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PREFACE 


TEE past twenty years or more have been a period of extremely 
rapid and significant development in statistical theory and prac- 
tice. Yet, while many of the recent contributions — particularly 
those of R. A. Fisher and his students — appear to have almost 
revolutionary significance for educational research, research work- 
ers in this field have in general failed to recognize their amazing 
possibilities, or at any rate have not widely realized these possibil- 
ities in practice. 

A part of this neglect has been due to the mistaken notion that 
it is seldom necessary to use “small” samples in educational re- 
search — that most of our samples consist of large numbers of 
pupils or of individual observations — and that, hence, *small 
sample" theory can be of relatively little practical interest or 
value to research students in education. In taking this attitude, 
we have overlooked the very significant fact that most of our sam- 
ples, however large in terms of numbers of individual observations, 
are not simple random samples, but consist of relatively homogene- 
ous and intact subgroups, such as the pupils in a single school or 
under a single teacher. The number of these subgroups, further- 
more, is usually indeed small, and it is only through the use of small 
sample theory that we can accurately evaluate the results obtained. 

Perhaps a more telling reason, however, for this continued neg- 
lect is that the only expositions of these techniques that have thus 
far been readily available, particularly Fisher’s Statistical Methods 
for Research Workers and The Design of Experiments, have proved 
inordinately difficult for students in education to comprehend. 
This fact is due in part to the unfamiliar statistical notation and 
terminology employed; in part to the frequent and wide gaps in the 
Sequence of logic which are left to the reader to fill but which can 
be readily supplied only by a reader with advanced mathematical 
training; and perhaps most of all to the fact that all illustrations 
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given are in the field of agricultural experimentation and = con- 
cerned with “plots,” “blocks,” “ yields,” “treatments, ete, 
rather than with “schools,” "classes," “scores,” “methods, 
a ils,” etc. 

na Ж primary purpose in this book, accordingly, has been 
to translate Fisher's expositions into a language and notation 
familiar to the student of education; to clarify the exposition 
further by presenting all steps in the logic, in a manner such that 
they may be followed by students with little mathematical train- 
ing; and to point out specifically and illustrate concretely what 
seem to be the most promising applications of Fisher’s methods in 
educational research. Particular emphasis has been placed upon 
the importance of more careful design in educational research. 
Many of the difficulties that have been met by educational research 
Workers in the analysis of their results have arisen from their tend- 
ency to plan their experiments and investigations with little 


direct regard to the methods of analysis that are later to be em- 
ployed, or even to p 


lytical procedures 
conduded. Oneo 
analysis of varianc 
sign is inseparable 
it difficult to igno 
actually initiated 


t been restricted to techniques which 
А. Fisher. 


In particular, an effort has been made to bring 
date on the logic of statistical inference and to 
eenly conscious of the constant need for very 


the student up to 
make him more k 
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critical consideration of the assumptions underlying any statistical 
technique he may employ. 

The writer is convinced that the development of a genuine un- 
derstanding of the nature of statistical inference and a thorough 
training in the use, of the methods of analysis of variance should be 
considered an absolute essential in the general preparation of re- 
search students in education. It is hoped, therefore, that this book 
will prove usable as a textbook in advanced courses in educational 
statistics or in the second half of a required full year introductory 
course. If so used, it will require considerable supplementation by 
other references, since (with a few minor exceptions) no attempt 
has been made in this book to discuss any problems that have 
already been adequately treated in the standard texts. It has 
been the writer's experience, however, that adequate consideration 
of the problems here treated will require a considerable share of a 
typical three-hour course. 

It may be noted that the omission of a set of exercises for the 
student has not been accidental. The methods here considered 
have been so little used in educational research as to make impos- 
sible, for the present, the collection of a set of exercises or examples 
based on actual data in the field of education. The omission of 
lists of references or supplementary readings at the end of each 
chapter is also deliberate. Most of the general references which 
could be given would be of doubtful value to the students to whom 
this book is addressed, for the reasons already indicated in the case 
of Fisher’s books. Furthermore, many of the original papers which 
have been consulted in the preparation of this volume have ap- 
peared in journals which are not readily accessible to students in 
education. (The more important of these, however, have been 
cited in footnote references.) 

The writer has been extremely fortunate in obtaining an unu- 
sual amount of assistance in the preparation of this book. He 
is grateful, first of all, to the students in his own classes who used 
the book in its preliminary mimeographed edition and who directed 
attention to many typographical errors and to instances in which 
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the lucidity could be improved. Various parts of the preliminary 
manuscript were read by Dr. C. H. McCloy of the State University 
of Iowa, Dr. Marian Wilder of the University of Minnesota, Dr. 
John C. Flanagan of the Co-operative Test Service, New York 
City, Dr. Edward E. Cureton of the Alabama Polytechnic Insti- 
tute, and Dr. Jack Dunlap of the University of Rochester, all of 
whom offered many valuable suggestions. The entire manuscript 
was carefully read by Professor Allen T. Craig of the Department 
of Mathematics of the State University of Iowa. Special acknowl- 
edgments are due Professor G. W. Snedecor of Iowa State College, 
who gave very generously of his time in consultations with the 
writer in the earlier stages of the book and whose Mathematical 
Statistics contributed greatly to the writer's own understanding 
of the possibilities in Fisher's methods. 

The major acknowledgment is due Dr. W. G. Cochran, formerly 
of the Rothamsted Experimental Station, Harpenden, England, 
and now of the Statistical Laboratory of Iowa State College. Dr. 
Cochran read the entire manuscript most painstakingly and offered 
a very large number of concrete and constructive suggestions, all 
of which the writer found it desirable to observe in the final revision 
of the manuscript. 


Grateful acknowledgment is made to Professor R. A. Fisher, 
and to his publishers, Oliver and Boyd, for 
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CHAPTER I 
FUNDAMENTAL CONCEPTS IN SAMPLING THEORY 


1. INTRODUCTORY 
Near ty all experimental research in education, and most of that 
which is not experimental in character, involves the drawing of 
inferences about a population from what is known of a sample 
taken to represent that population. Accordingly, one of the most 
important of the technical problems faced by the research worker 
is that of determining just how much may confidently be said about 
a population from what is known of a sample, or of ascertaining the 
degree of confidence which may be placed in the inferences drawn. 
Closely related to this is the equally important problem of how to 
select a sample, or how to plan an experiment, so that it may yield 
the most dependable or precise information about the population 
involved, and so that it will permit an objective and valid estimate 
of the degree of precision attained. 

In recognition of their extreme importance in the training of the 
research worker in education, a major share of this text will be de- 
voted to a detailed consideration of these problems. Particular 
attention will be given to the deszgn of experiments, to small sample 
theory, and to the testing of statistical hypotheses. While the con- 
tributions to statistical theory which have recently been made in 
these areas appear to be of revolutionary significance in educational 
research, they have thus far been available to the research worker 
in education for the most part only in the literature of agricultural 
research and mathematical statistics. The language and the set- 
ting in which they have there been presented have proved in- 
ordinately difficult for the student of education to comprehend, 
and have perhaps seriously retarded their much needed introduc- 
tion into educational research practices. It is accordingly one of 
the major purposes of this text to interpret these contributions in 
a language and notation familiar to the student of education, and 
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to discuss and illustrate their possibilities with specific reference 


to the types of problems and materials with which he will have to 
deal. 


specifically, it will be as: 
interpret the basic sta 


2. DEFINITIONS OF IMPORTANT TERMS 


ing of human beings; 
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error formulas assume infinite populations, whereas the populations 
to which the formulas are applied are usually finite. This assump- 
tion offers little difficulty, however, since the populations actually 
involved are usually so large that they may be considered as prac- 
tically infinite. 

A real population is one that actually exists; a hypothetical pop- 
ulation is one that exists only in the imagination. Many of the pop- 
ulations involved in educational research are hypothetical. For 
example, an experiment may be conducted to determine the rela- 
tive effectiveness of two methods of instruction. For the purposes 
of the experiment, two groups of seventh-grade pupils are selected, 
one of which is taught by Method A, the other by Method B, and 
at the close of the experiment comparable measures of achieve- 
ment are secured for all pupils. In interpreting the results, the 
pupils who studied under Method B, for example, are considered 
as a sample from a population of seventh-grade pupils, all of whom 
had been taught by this method. Since the pupils in the experi- 
ment may be the only ones who have ever been taught by this 
method, this population is of course hypothetical. It is neverthe- 
less useful to recognize that the method might produce different 
results if used with other seventh-grade pupils, and that the ex- 
perimental results must therefore be considered as only a fallible 
indication of the results that would be generally attained. In 
some instances, we may wish to select a sample from a real popula- 
tion, but find it impracticable to secure an unbiased sample from 
that population. In that case we may use the sample that is 
available to us, “construct” a hypothetical population from which 
the given sample might have been drawn at random, and restrict 
our generalizations to that hypothetical population. 

A random sample is one selected in such a fashion that every 
member of the population has an equal chance to be selected. 
This means that each member must be selected independently of 
all others. It is useful also to think of a random sample as one so 
drawn that all other possible combinations of an equal number of 
members from the population had an equal chance to constitute 
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the sample drawn. Suppose, for example, that we are drawing а 
random sample of 300 cases from all high-school pupils in Indiana. 


^ There is, of course, an almost unlimited number of different com- 


binations of 300 pupils in this population. One of these combina- 
tions, for instance, might consist of 2 pupils from Terre Haute, 13 
from Lafayette, 276 from Indianapolis, and 9 from Gary. If our 
sampling is random, this particular combination must have the 
same chance of being selected as any other. Emphasis is placed 
on this latter concept of random sampling, since it indicates quite 
clearly that the samples used in educational research are seldom 
simple random samples. In Practice, accessibility or feasibility 
are often determining factors in sampling. If we were actually 
drawing a sample of 300 pupils from Indiana high schools, under 
the methods usually employed the particular combination de- 
scribed above would have ло chance of being drawn. The only 
feasible procedure would be to secure the co-operation of a few 
schools that together would provide the 300 cases needed; we 
could not expect to select each pupil independently from the whole 
population, and then use the pupils so selected regardless of how 
they were scattered throughout the state. The extreme signifi- 
cance of such practical obstacles to random sampling in educational 


research will be made clear in a later section. The procedure that 
may be followed to draw a random sample when no such obstacles 
exist is also to be explained later. 


A biased sam 
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biased with reference to vocational interests, but their mean age 
might be an unbiased estimate of the mean age of all sophomores 
in the university. Freedom from bias is one of the most important 
characteristics of a sample, and random sampling is one of the 
surest ways of obtaining it. 

A stratified sample is one which may be subdivided into groups, 
each of which may be considered a sample from the corresponding 
subdivision of the entire population. For example, we might sub- 
divide the entire population of adult males in the United States 
into various income groups and select a random sample (of any 
Size) from each income group. The total sample thus secured 
would be considered a stratified sample. Again, in selecting a sam- 
ple of schools in a given state, we might classify all schools accord- 
ing to enrollment, and select any desired number of schools from 
each enrollment classification. The numbers constituting the 


subgroups in a stratified sample are arbitrarily determined, апа . 


need not be proportional to the numbers in the corresponding sub- 
divisions of the population. Stratified sampling has much the 
same advantages as controlled sampling, which are discussed in 
the following paragraphs. Methods of estimating the standard 
error of the mean of a stratified sample will be presented later 
(pages 157 ff.). 

A controlled sample is one in which the selection is not left to 
chance, or not entirely to chance, but in which the distribution of 
some selected characteristic is made to conform to some prede- 
termined proportion. It is a stratified sample in which the sub- 
group numbers are proportional to the corresponding numbers in 
the population. For example, we may wish to study (in a given 
population of school children) some trait, such as weight, that is 
known to be related to sex, and may wish to insure that our sam- 
ple does not by chance contain an undue proportion of either sex. 
Assuming that there are equal numbers of boys and girls in the 
entire population, we might then select a certain number of boys 
at random from all boys and the same number of girls at random 
from all girls. In other words, we would make our sample repre- 
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sentative with respect to sex. All samples so drawn would then 
contain the same predetermined proportion of boys and girls, and 
hence would not be simple random samples, since in random sam- 
ples this proportion would fluctuate from sample to sample due to 
chance selection. Samples of this type are also frequently known 
as representative samples, although this term has no standard 
meaning. 

The exercise of control in sampling is worth while to the degree 
that the characteristic whose distribution is controlled is related 
to the characteristic being studied. For example, there would be 
little point in “controlling” sex in a study concerned with the 
mean intelligence of a population of school children, since it is 
known that boys and girls do not differ appreciably in performance 
on general intelligence tests. A control of the chronological age 


distribution, on the other hand, might markedly increase the re- 
liability of obtained results in 


of controls, however, always 
sumed or predetermined distr: 


lled samples 
For the purpose of an in- 
ple, we might wish to select two 
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samples of school children so that they will show the same distribu- 
tion of intelligence. To do this, we might divide the available 
pupils into “levels” of intelligence, each level consisting of pupils 
with the same intelligence test score, or with intelligence test 
scores in the same relatively narrow interval along the scale. We 
might then assign the pupils in each level to our two samples at 
random, half to each sample. The pupils would then be ran- 
domized within each level, but the samples would not be random 
samples from any population (although the two samples might be 
considered as a randomly selected pair from an infinite number of 
samples all of which show the same distribution of intelligence). 
Controlled sampling may or may not involve random selection. 
For example, in selecting a sample of schools in a study of school 
cost-accounting practices in Iowa, one might decide in advance to 
select a certain number of schools from each enrollment classifica- 
tion within each of a number of geographical districts within the 
state. Within these restrictions the selection of schools might be 


fortuitous, or it might be from only those schools that are willing > 


e — that is, the selection may not be strictly random 
at any point. If, however, there are any systematic differences in 
accounting practices of large and small schools, or of schools in 
different parts of the state, the controls may contribute appre- 
ciably to increased reliability of the results. Controlled sampling 
which does not involve random selection suffers from the very 
serious disadvantage that it does not permit any objective de- 
scription of the reliability of the results obtained, but is neverthe- 
less often worth while, particularly where random sampling is in 
any event impracticable. 


The exercise of controls in sampling, particularly when the ul- 
is an extremely useful device in educa- 


to co-operat 


timate selection is random, 
tional research, and its possibilities appear to have been seriously 
neglected. Considerable attention will therefore be given to this 
problem in this text (pages 157 fi), particularly with reference 
to experimental research, in which the problem is essentially the 
same as that of how to design the experiment (Chapter IV). 
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A parameter is any measure based upon an entire population. 
Parameters are perhaps more widely known as “true” measures. 
For example, a true mean, which is a parameter, is the mean of all 
members of the population. A statistic is any measure derived 
from a sample, and is frequently referred to as an “obtained” 
measure, or as an “observed” measure. A parameter always 
has an exact constant value, although usually unknown; a statis- 
tic varies in value from sample to sample. 
determinable, since usually the entire popu 
in most instances, the best we can 
by drawing a sample and calculati 
For example, the best estimate of a 
from a random sample is the mean 


Parameters are seldom 
lation is not accessible; 
do is to estimate the parameter 
ng the corresponding statistic. 
true mean that may be derived 
of that sample. The best esti- 
not always the corresponding 
For example, if our sample is 


‘ple; we shall see later how a better 
estimate may be obtained (pages 48 ff.). 


A sampling error is the difference between a parameter and an 
estimate of that parameter which j 


5i 
ical distribution of a st 
samples. The distributio 
ber of Samples, each con 
given population, would 
of a sample of 50 cases. 

a distribution of the sam; 
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liability of estimates based on samples. It is known, for example, 
that for most populations which will be encountered in practice 
the sampling distribution of the mean of a large random sample is 
a normal distribution, and hence if we can estimate the standard 
deviation of this sampling distribution we can also estimate the 
probability that an obtained mean will differ from the true mean 
by more than any given amount. Many students seem to have 
the notion that all sampling distributions are normal, particularly 
if the samples are simple random samples drawn from normal 
populations. This is a very serious misconception. For example, 
the sampling distribution of the standard deviation of a small ran- 
dom sample is markedly skewed positively; hence, even though we 
knew the standard error, we could not interpret it in terms of the 
normal probability integral table. It is sometimes possible to es- 
timate the standard error of a statistic even though the form of its 
sampling distribution is unknown. For example, we have long 
had a standard error formula for the product-moment correlation 
coefficient, but it is now known that the sampling distribution of 7 
is markedly skewed negatively when the true is high. Many in- 
stances may be found in the literature of educational research of 
attempts to use the normal probability integral table in inter- 
preting the standard errors of r’s of large magnitude, and often with 
seriously misleading results. It is important to note, then, that 
a standard error formula is of little value unless the form of the 
sampling distribution is known to be approximately normal. 
Sampling distributions are described in terms of mathematical 
equations, or in terms of probability integral tables derived from 
these equations. The student is already familiar with the normal 
probability integral table; he will later have occasion to use similar 
tables for other forms of distributions. The derivation of these 
formulae and tables is a problem for the mathematical statistician. 
The typical research worker in education, because of his lack of 
advanced mathematical training, must be content to accept these 
formulae and tables on faith, and no attempt will be made in this 
text to show how they are derived. It is important to note, how- 
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ever, that all of these derivations at some point involve the as- 
sumption of random selection. Most sampling distributions thus 
far determined are for simple random samples only, but some 
sampling distributions are known for controlled samples in which 
the last step in selection is random. Sampling distributions de- 
rived for randomly selected samples may be very misleading when 
applied to samples that are not truly random, even though they 
seem for all practical purposes to be equivalent to random samples. 
It is for this reason that so much emphasis is later placed in this 
text on the exercise of meticulous care in making random selections. 

The precision or reliability of an estimate is dependent on the 


variability of its sampling distribution. It is significant that no 


estimate is of any value whatever unless something is known about 
its precision. 


This does not mean that we must always be able to 
compute the standard error of an estimate, but it does mean that 
unless we at least have an intuitive or subjective notion, based on 
observation or experience, of how Precise an estimate is, we might 
as well not have an estimate at all. Obviously, it is much better 
if the precision can be quantitatively described on an objective 
basis. The advantage of objective description is so great that we 
might frequently use the less Precise of two available estimates, be- 
cause the exact degree of Precision is known in one case but not in 
the other. It is for this reason that random selection is so im- 


portant. since it is only for samples involving random selection that 
ae descriptions of the Precision of estimates may be ob- 
ained. 


from the sample. 


tion has certain advantages: (x) i 
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estimate is after all only an estimate; (2) it recognizes that it is 
just as important to know how precise or dependable the estimate 
is as to have an estimate at all; and (3) it suggests more directly 
the nature of the logic by which we describe the degree of precision 
attained. 

Suppose, to take a very simple illustration, that we wish to es- 
timate the mean weight in a certain population of ten-year-old 
girls. Suppose that we have drawn a random sample of 400 girls 
from this population, and have found the mean weight for this 
sample to be 65 pounds. Our “best estimate” of the population 
mean is therefore 65 pounds. This is but another way of saying 
that the hypothesis that the true mean is 65 is the best hypothesis 
we can make with the information at hand, or that it is the hypo- 
thesis which best accounts for the fact that our sample mean is 65 
pounds. ‘There are, however, many other hypotheses about the 
true mean that might readily be defended. It is possible, for in- 
stance, that the true mean is 66 pounds, and that our obtained 
mean of 65 represents a chance deviation of one unit from this true 
mean. How tenable this hypothesis is, or with what degree of 
confidence we may accept or reject it, depends upon the relative 
frequency with which the obtained means of random samples of 
this size would deviate one unit or more from the true mean. For 
example, if obtained means deviating as much as 1 unit from the 
true mean would very frequently occur by chance, we would have 
no good reason to reject the hypothesis that the true mean is 66 
simply because a mean of 65 was found in one sample. However, 
if means deviating this much from the true mean would only very 
rarely be found, then the fact that the mean of our sample is 65 
throws serious doubt on the hypothesis that the true mean is 
66. 

This relative frequency could be determined if we knew the 
sampling distribution of the mean of a sample of 400 cases, and in 
this instance sampling theory does provide us with a means of 
describing the sampling distribution. It is known that the sam- 
Pling distribution of the mean of a large random sample is usually 
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normal in form, and that its standard deviation is given by the 


O population 
formula oy = — 82 a) 


Vn 


in which z is the number of cases in the sample. The usefulness 
of this formula is limited by the fact that it involves a parameter, 
the o of the population, which of course we cannot know. How- 
ever, we can find the o of our sample, and since our sample is quite 
large, we can derive a useful estimate of ту by substituting the с 
of the sample for the ,,, in the formula. Let us suppose that we 


have found the o of the sample to be 8 pounds. Our estimated 
standard error of the mean is then 


We now have a complete description (subject, however, to what- 
ever error is involved in the preceding estimate) of the sampling 
distribution under the hypothesis to be tested. We have assumed 
its true mean to be 66, have estimated its standard deviation to be 
-4, and know that it is a normal distribution. In this distribution, 
the value 65 deviates 1/.4 — 2.5 standard deviations from the 
hypothetical true mean. According to the normal probability 
integral table (Appendix, Table 17) less than two per cent of 
the cases in a normal distribution deviate so far from the mean. 
Hence, if our hypothesis is true, something has happened in this 


one sample that would occur by chance in less than two per cent 
of such samples in the long run. 


able to suppose that so rare an e 


this one case, we conclude that the hypothesis itself must be false. 


Consider, on the other hand, the hypothesis that the true mean is 
64.5. Under this hypothesis, means deviating as much from the 


true mean as does our obtained mean of 6 5 would be obtained in 
about 22 per cent of samples of this size. 


could not reject the hypothesis with any 
The degree of confidence with which 


Since it would be very unreason- 
vent has actually “come off” in 


In this case, we obviously 
high degree of confidence. 
we may reject (or accept) 
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any hypothesis would then depend upon the relative frequency 
with which results deviating as much from the hypothetical as 
those found in our sample would occur by chance if the hypothesis 
were true. Whether we would categorically reject or accept the 
hypothesis, that is, whether we adjudge it categorically as either 
“tenable” or “untenable,” depends upon the degree of confidence 
which we have arbitrarily decided is essential. We might, for 
example, decide to consider any hypothesis “tenable” under 
which results as divergent as those obtained in our sample would be 
found in at least 5 per cent of such samples, or we might be more 
conservative and reject only those hypotheses under which chance 
would account for the divergence of our obtained result less than 1 
per cent of the time. It will be useful to identify such arbitrarily 
defined levels of confidence in terms of per cents. For instance, 
we might term the first of the levels just defined as the * 5 per cent 
level,” and the second as the “т per cent level.” In general, to 


say that we reject an hypothesis at the “х per cent level of confi- 


dence” is to say that the absolute divergence of our observed re- 
result would be exceeded in less 


sult from the hypothetical true 
than x per cent of such samples if the hypothesis were true. What- 
ever level of confidence we have decided to employ as a minimum, 
we may, by testing successive hypotheses, find what limiting values 
of the true mean constitute “tenable” hypotheses at that level. 
We could then say that we are confident (at the given level) that 
the true mean lies between these limits. For example, in the case 
of our illustration we may be confident at the 5 per cent level 
that the true mean does not lie outside 64.22 and 65.78, and at the 
1 per cent level that it does not lie outside 63.97 and 66.03. (Ac- 
cording to Table 17, page 261, 5 per cent of the cases in a normal 
distribution will deviate more than 1.960 с from the mean. Hence 
the limiting vaiues of the true mean are 65 + 1.960 X .4 = 65+ 
-784 = 64.22 and 65.78.) 

Since the preceding interpreta 
that to which the student is accustomed, it may be well to draw 
attention to some of the more important aspects of the logic in- 


tion may differ somewhat from 
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volved. Many students have learned to interpret results like 
those just given by constructing a normal distribution with a 
mean of 65 and a standard deviation of -4, finding the per cent of 
the distribution which lies between any two selected points in this 
distribution, and then stating that the “chances” are such and 
such that the true mean lies between these values. He might then 
say, for instance, that “the chances are 95 in roo that the E 
mean lies between 64.22 and 65.78,” or that “the probability is 
less than r in тоо that the true mean lies outside of the values 
63.97 and 66.03.” For most practical purposes, the end result is 
the same as if the “level of confidence” type of interpretation is 
employed, but the reasoning involved is based on a questionable 
assumption. To draw a normal curve with a mean at 65 and to 
determine the “probability” that the true mean lies between any 
points along the curve is to reason as if there are many values of 
the /rue mean, and that these true means are normally distributed 
about the mean obtained for the sample. Statements of probabil- 


vent only before it occurs, and then 
Knowing that the true mean is 66 we 


can make the statement wit 
unknown true mean hasa 
We may now summariz 


has merely to re- 
The third step 
tables, in what 
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per cent of samples the obtained measures will deviate from the 
true measure as much as or more than the measure obtained in 
the sample at hand deviates from the hypothetical true measure. 
The fourth step is to reject the hypothesis, or not, depending upon 
the “level of confidence” which has been arbitrarily determined 
in advance. 


4. TESTS OF SIGNIFICANCE: THE NULL HYPOTHESIS 

In many sampling studies the interest is not so much in the limits 
within which a parameter may confidently be said to lie as in the 
single possibility that the parameter is zero. For example, we may 
ask, “Is there any correlation between these two traits?” or “Is 
there any difference in the effectiveness of these methods of in- 
struction?” or “Are girls of a given age equal in intelligence to 
boys of the same age?” In such cases we may wish to test the 
hypothesis that the true correlation is zero, or that the true dif- 
ference is zero, but may not be particularly concerned with the de- 
gree of correlation or with the magnitude of the difference if any 
does exist. Such hypotheses — that the parameter is zero — are 
known as null hypotheses; If a statistic is such that the null | 
hypothesis may be rejected with confidence, we say that the statis- | yl 
tic is significant, meaning that it signifies that the parameter value! — 
is not zero. For example, we may select two random samples of 
pupils, teach one by one method and one by another, and find at 
the close of the experiment that the difference in final mean achieve- 
ment is larger than could reasonably be attributed to fluctuations 
in random sampling, i.e., too large to permit us to accept the null 
hypothesis. We шау then say that the observed difference in „е 
mean achievement is significant. It is important to note, however, 
that to prove the difference significant does not establish the cause 
of the difference. In rejecting the null hypothesis we have only 
rejected one possible cause — chance fluctuation due to random 


? is used by Fisher (Design of Experiments, p. 18) to 


* The term “null hypothesis 
УР be interested in disproving, not merely the 


denote any exact hypothesis that we may 
hypothesis that a certain parameter is zero. 
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selection. What really accounts for the difference — whether it 
is a real difference in effectiveness of the methods, or some ex- 
traneous factor which was not adequately controlled in the ex- 
periment — is quite another matter. 

It is convenient to speak of levels of Significance, just as we spoke 
of levels of confidence in the preceding section. When we say that 
a statistic is significant “at the 5 per cent level,” we mean that the 
observed divergence from zero would be exceeded in less than 5 
per cent of similar samples if the null hypothesis were true, or that 
we may be confident, at the 5 per cent level, that the null hypothe- 
sis is false. The levels of significance most frequently employed 


are the 5 per cent and 1 per cent levels, and some tables are con- 


structed for these levels only. It has been customary in educa- 


tional research to declare a Statistic significant if it is three or 
more times as large as its standard error. This is not satisfactory 
as a general practice, since it is limited to the case where the 
sampling distribution is normal. It is also too rigid a test for most 
purposes, since to require the “significance ratio” to exceed 3 is 
equivalent to requiring that the Statistic be significant at the 0.26 


al sampling distribution). If the 


» Of claiming significance 
The farther apart we 
€ hypotheses, i.e., the higher the level of 
the greater is the danger that we will in- 
15 among the "acceptable? hypotheses. 
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The more we narrow the range of acceptable hypotheses, i.e., the 
lower the level of significance we employ, the greater is the danger 
of rejecting a true hypothesis, or of claiming a real difference when 
no real difference exists. In general, the latter danger is the more 
serious in educational research. If we find in a methods experi- 
ment that the difference is not significant, we have in effect de- 
clared the experiment inconclusive. That is, we recognize that 
there is a possibility of a real difierence, even though none has been 
proven, and in effect invite further experimentation, with more 
precise procedures, to show that the difference is there. However, 
if we find a significant difference, the experiment is usually con- 
That is, the effect is to discourage further 
s of verification only, and any further 
ly to be based on the prem- 


sidered as conclusive. 
experimentation for purpose 
experimentation in the same area is like 
ise which has presumably already been established, but which may 
be false. If our tests are too exacting, we may needlessly delay 
experimentation along new lines while waiting further verification 
of our tentative conclusions; if our tests are not sufficiently exact- 
ing, we may follow too many false leads. Generally, we prefer to 
take the former risk, and consequently demand that our differences 
be significant at a high level before attempting any generalizations. 
However, we need not always employ the same high level. For 
example, if we are about to recommend a new method of instruc- 
tion for a school system, and if the recommended change will 
prove very expensive and involve serious administrative difficul- - 
ties, we would want to be very sure that we are not recommending 
a method which is inferior or only equal to the old. However, if 
the change could be very easily and cheaply made, we might be 
more concerned with the danger of rejécting a method which is 
superior to the old, and might make the recommendation for a 
change even though not highly confident that we are right. It 
is important that the research worker recognize clearly what is in- 
volved in the choice of a critical level of significance, and that he 
weigh carefully the possible consequences of each type of error in 


making that choice. 
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3. LIMITATIONS OF STANDARD ERROR FORMULAS CONTAINING 
POPULATION PARAMETERS: NEED FOR A SPECIAL SMALL SAMPLE 
THEORY 
It was noted on page 12 that the formula for the standard error 

of the mean is based on an independent population parameter — 

the с of the population. Since this parameter is usually unknown, 
we are unable to compute the true Tu, but are able to secure an 
estimate of it by substituting for the parameter in the formula 
the corresponding statistic for our sample. Many, if not most, 


of the standard error formulas already familiar to the student 
are of this type. 


rge the statistic so closely approximates the param- 
à negligible error is introduced. When the sample 


à given value, M m 
the observed mean di 
divided this differe 


number equal to this ratio. 
cent of cases in a normal dist; 
by more than this number of 


We then read from the table the per 
ribution that deviate from the mean 
9 units. Upon reflection, the student 
will recognize that this procedure assumes that the ratio Mac ds 
est'd oy 
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К 2 Mi И. 
approximate the true c, the ratios are very nearly equal 


est'd Oy 
" . Mo— Mr 

to the corresponding ratios Sua m ; and we know that the latter 
ratios are normally distributed with unit standard deviation. 
Hence, for large samples the aforementioned assumptions are very 
Mo — Mr 
est'd Cy 
are not normally distributed, nor is the standard deviation of their 
sampling distribution equal to unity. In any small sample the 
observed mean may be relatively large (or small) at the same time 
that the observed standard deviation is relatively small (or large) 
as the result of chance fluctuations. In such cases, of course, the 
estimated су would be smaller (or larger) than the true су. We 
would therefore more frequently find large (either positive or neg- 
ative) values of Mo — М» than of dol M. 

est'd Oy true ту 
M . Mo— Mr 
standard deviation of the ratios and dy would be larger than 1.00 
— how much larger * depending on the size of the samples — and 
the distribution would be more peaked than the normal distribu- 
tion. All of this means, of course, that the use of the normal prob- 
ability integral table to interpret this ratio is not justified if the 
sample is very small, and if so used may lead to serious misinter- 


nearly satisfied. For small samples, however, the ratios 


; and as a result the 


pretations. 
It may be noted finally that the standard deviation of a small 


sample is not only highly unstable, but also tends to be systemati- 
cally smaller than the true standard deviation. We shall see 
later (page 48) that it is possible to make a better estimate of the 
true с from a small sample, but even the best estimate is still 
highly unstable, and what was said in the preceding paragraph 
would still be true even if the best available estimate of the popu- 
lation ø were used. 


+ * 5 
* The standard deviation of the ratios is equal to V n—2 
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We have seen that the objections to the use with small samples 
of the formula for the standard error of the mean arose from the 
fact that to test an hypothesis about one parameter (the true 
mean) we are compelled to estimate another parameter (the true 
Standard deviation). In the illustration considered in the pre- 
ceding section, we were in effect testing the hypothesis that the 
true mean was 66 on the further hypothesis that the true standard 
deviation was 8. If the sample had been very small, the latter 
hypothesis would have been very shaky indeed. Objections simi- 
lar to those just noted apply with equal force to any other standard 
error formulas or tests of significance based upon a population 


parameter other than that with which the hypothesis to be tested 
is concerned. To deal satisfactoril 


test of significance must re 
the statistic available fro 
only parameter needed to 
tion exactly must be that 
is directly concerned. 

A great deal of effort 
the formulation of statis 
there are now available 
of the needs of educatio 


be Presented later in this text — the purpose of the preceding dis- 


monstrate the need for them. It is 


Аси 
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misconception of the nature of our samples, as the following section 
will attempt to show. Most of our samples — regardless of the 
number of pupils or observations involved — are “small” samples, 
and the techniques that we have generally employed in the past 
are definitely inappropriate and have often been very seriously 
misleading. 


6. THE PROBLEM OF SAMPLING IN EDUCATIONAL RESEARCH 

We have already noted that many of the populations with which 
we are concerned in educational research are such that it is highly 
impracticable to draw truly random samples from them. Suppose, 
for example, that we wished to draw a sample of оо pupils from 
all pupils in the eighth grades of Iowa public schools. As was 
noted before in a similar illustration, if we were to select a random 
sample of pupils from this population we would have to give every 
eighth-grade pupil in every school an equal chance to be selected. 
If we were to do this, we might find that the 500 pupils finally 
selected would be widely scattered over the whole state in several 
hundred different schools, and would therefore, for all practical 
purposes, be inaccessible for measurement, observation, or ex- 
perimentation. What we would actually do, therefore, would be 
to secure the co-operation of, say, то schools, which together could 
provide 500 eighth-grade pupils. We might be able to select the 
schools at random from all schools, but usually even this would be 
impracticable. In general, the best we could do would be to pre- 
pare a list of schools which we know in advance might be willing 
to co-operate in our investigation, and then select ro schools at 
random from thislist. If then we have no reason to suppose that 
the schools in our list differ systematically from the other schools 
in the state with reference to the characteristic(s) we are investi- 
gating, we might be justified in considering our sample of то 
Schools as equivalent to a random selection from all schools in the 
State. Even so, we could rot consider our 500 pupils as equivalent 
to a random selection from all pupils in the state. 

The reasons for this is that the pupils in different schools show 


emu ы NOAG 
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large systematic differences in almost any trait that may be the 
subject of a research investigation. The pupils in one school may 
have had the advantage of a long succession of superior teachers 
in the preceding grades, while those in another may have had con- 
sistently incompetent teachers under poor supervision. The pu- 
pils in one school may come from a high-class residential section of 
the community, a section made up of professional and successful 
business men, while those in another may have come from an im- 
poverished and underprivileged section made up largely of illiter- 
ate day laborers of recent foreign extraction, 


vidual pupil a 


haps already familiar with much of the almost 
of evidence o 


The student is per- 
overwhelming mass 
th while to consider 
The State Univer- 
y conducts a state-wide end-of-the-year 
Program which involves the administration of 
50,000 pupils in sev- 
Tn the 193 5 Program, an objective test 
Г Correctness was administered to all 
ninth-grade Pupils in 274 Schools. 


h 90 ninth-grade pupils, the total distri- 
bution of pupil scores is given in T. 


hools, we should, according to sampling 
rd deviation of th 


ў „У 100 3-15, 31.54 being the Standard devi- 
ation of the tota] pupil distribution, and roo the minimum number 


Actually, however, we see that the stand- 
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TABLE I 
DISTRIBUTIONS OF INDIVIDUAL AND MEAN Scores or NiNTH-GRADE 
PUPILS ON THE 1935 IOWA Every-Purit TEST IN ENGLISH COR- 
RECTNESS FOR 24 SCHOOLS 


(Each school tested over тоо ninth-grade pupils) 


Pupil Scores School Means 

Scores Frequency Means Frequency 

170-180 91.0-04.49 1 
160-169 87.57-90.90 
150-150 84.0-87.49 
140-149 80.5-83.00 
130-139 77.0-80.40 
120-129 73-5-76.09 
110-IIQ 70.0—73.40 
100-109 66. 5-69.99 
go~ 99 63.0-66.49 
59.57-62.99 
56.0-59.49 
52-57-55-99 
49.0-52.49 
43-5-48.99 
42.0-45.49 
38.57-41.90 
35.07-38.49 
31.5734-09 
28.0-31.49 

N 

M 

S.D. 


HH 
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ard deviation of school means is 13.29, or more than 4 times as 
large as would be expected on the hypothesis of random sampling. 

In consideration of these much-larger-than-chance differences 
between schools, let us consider further our illustrative sample of 
500 pupils of Iowa eighth-grade pupils. It is very obvious that 
had all of these pupils come from a single school, the sample would 
represent a very poor basis for generalization about the popula- 
tion, particularly in contrast to a truly random sample of equal 
Size. In the random sample, hundreds of different schools would 
be represented, in the sample just considered only one is repre- 
sented — and that might be one of the schools in which the level of 
achievement is very hign, or it might be one in which the level is 
very low. It should be equally evident that а sample in which 
only ro schools are represented is neither as good as nor equivalent 
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to a random sample of 500 cases from the population at large. 
With so few schools involved, the danger is appreciable that we 
might by chance have selected 10 good schools, or то poor ones, or 
that most of the schools used are good schools or poor ones. If 
then, we were to use the standard-error-of-the-mean formula with 
this sample, substituting soo for ж, we would very seriously exag- 
gerate the reliability of the mean. In spite of the fact that it 
contains soo pupils, this sample must be considered as a very 
"small" sample — a sample of only ter schools — and in order to 
evaluate any estimate derived from i 
theory appropriate for small samples. 
In general, then, many of the samples employed 
research consist of a small number of intact groups ( 
in the same or different schools, groups of pupils in s 
ings in the same System, or groups of pupils from 
munities or geographical regions), or of a small n 
samples selected from different “ 


t we must have a sampling 


in educational 
such as classes 
eparate build- 
different com- 
umber of sub- 


| | of intact groups or subsamples of 
Which the total sample is constituted. In other words, the unit of 


sampling in educational research is often the class, the school, or 
the community, rather than the pupil. It is for this reason that 
the need is so great in educational research for a special small 


sample theory, and that this text is in so large part devoted to an 
exposition of this theory. 


THE TECHNIQUE OF RANDOM SELECTION 25 


samples, but that they may involve random selection, and that it 
is therefore sometimes possible to deduce the sampling distribu- 
tions for estimates obtained from such samples. Simple random 
sampling is often impracticable in educational research, but it is 
nearly always possible to plan our investigations and experiments 
so as to provide for random selection, and thus to utilize sampling 
theory in interpreting our results. Since those interpretations will 
be valid only to the degree that the selection was actually random, 
it is obviously important that the student be provided with a tech- 
nique that will insure random selection, in so far as that is possible. 

In many of the instances in which random se:ection is necessary 
in educational research, the selection is made from a relatively 
small number of cases. This is particularly true in experimental 
work. For example, in each of the schools involved in a methods 
experiment, we may wish to divide the seventh-grade pupils at ran- 
dom into two or more equal groups to be taught by different meth- 
ods, or we may wish to assign the classes (as already organized) at 
random to the different methods. Again, we may divide the 
available pupils into levels of intelligence, and within each level 
assign the pupils at random to the experimental treatments. 

One method of making random selections in situations of this 
kind may be described as the “lottery” method. For example, if 
we wished to split a group of 30 pupils at random into 3 groups of 
1c each, we could prepare 3o cards or slips of paper on each of 
which is written the name of one pupil, shuffle or mix these cards 
very thoroughly, and then “deal” or draw blindly 3 sets of то 
cards each. This is a troublesome procedure, however, and intro- 
duces the danger of bias through improper mixing or drawing of 
the cards. 

А more certain and more convenient procedure is to make use of 
a table of “random numbers." For the convenience of the stu- 
dent, a part of one such table is reproduced in the Appendix (Ta- 
ble 18). The manner in which the original table was constructed 
is described on page 18 of Statistical Tables for Biological, A gri- 
cultural and Medical Research, by R. A. Fisher and F. Yates 
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(Oliver & Boyd, London, 1938). It is sufficient to say here that 
the digits in this table were so selected that any digit from o to 
9 had an equal chance to appear in any given position in the table. 
The manner in which this table may be used should be made clear 
by the following illustrations. 


Illustration No. т: To assign 5 classes at random one {о cach of 5 ex- 
perimental treatments. 

Number the classes and the treatments separately from 1 to 5 
inany order whatever. Select any point at haphazard in the table 
of random numbers. Reading in any direction from this point 
(right to left, bottom to top, diagonally, etc.) read the first five 
unlike two digit numbers (skipping any that may previously have 
been read) from the table. Assign the first of these numbers to 
class т, the second to class 2,etc. The class with the highest ran- 
dom number will then be assigned to treatment 1, that with the 
second highest to treatment 2, etc. ' 

Suppose, for example, that the first number selected haphaz- 
ardly is that in the z4th row and the 4th double column on the first 
page of Table 18. Reading to the right from this point, the 


first five unlike two-digit numbers are 19, 95, 50, 92 and 26. The 


second class would therefore be assigned to treatment т, the fourth 
to treatment 2, the third to treatment 3, etc. 


The “starting point” in the table should be determined before 
looking at any number in the table. In the preceding case, for in- 
Stance, the decision to begin with "the r4th row in the 4th col- 
umn on the first page of the table" should be made before looking 


in the table. Otherwisc one might, without being fully conscious 
of the fact, begin with a 


c large number, and thus in effect deliber- 
ately insure that class r will receive treatment r, or otherwise bias 
the selection. Furthermore, once having selected the starting- 
point and direction, no peculiarity in the numbers read should be 
permitted to cause one to discard the results and start anew at 
another point. 
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Illustration No. 2: To select 20 pupils at random from 62 available 
pupils. 

Number the pupils from oo to 61 in any order whatever. Turn 
to the table, and from any point and in any direction read the first 
20 two-digit numbers that are less than 62, skipping any number 
previously read. For example, beginning in the 11th row of the 
sth column on the first page of Table 18 and reading downward, 
the first 20 unlike numbers below 62 are 46, 12, 13, 35, 43; 53) 
61, 24, 59, об, 20, 38, 47, 14, тт, оо, бо, 23, 19, and 53. The 
pupils who had previously been assigned these numbers would 
then be the 20 required. If these numbers are checked off in the 
original list of numbered pupils as they are read from the table, 
there will be no difficulty in avoiding duplications. 

If the selection is made from more than 99 cases, we must read 
three- or four-digit numbers from the table, as the case may be. 
These may be secured by combining columns in the table. Sup- 
pose, for instance, we wish to select 15 schools at random from the 
418 high schools in a given state. We would first number all 
schools from ooo to 417 in any order whatever. We would then 
combine, say, the 7th double column (on the first page of Table 
18) with the first half of the next column to the right, to secure 
a “column” of three-digit numbers. Reading upwards from the 
bottom of this “column,” the first 15 unlike numbers less than 418 
аге 044, 416, 377, 358, об, 057, 389, 325, 001, 373, 299, 278, 271, 
332, and 395. The schools previously assigned these numbers 
would then constitute our sample of 15. If the total number from 
which the selection is to be made is a number like 160, for example, 
considerable “hunting” would be required in the table to find 
three-digit numbers less than this value. А more convenient 
procedure in such cases is explained by the following example. 

Suppose that we wished to select 5 cases at random from 120 
available cases, numbered from o to 119 in any order whatever. 
We would first observe that 120 is contained in 999 eight times, or 
that 8 X 120 = обо is the largest multiple of 120 which is con- 
tained in 999. We would then select random numbers less than 
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960 from a three-digit column of the table, and divide each by 8, 
dropping any remainder. The first five unlike quotients would 
then be the numbers of the cases selected. Suppose, for instance, 
that we begin at a point in the table in which the following three- 
digit numbers appear in the order given below. 


Random Numbers Quotients 
562 70 


815 IOI 
982 


322 
057 
815 
723 


"The cases numbered 70, 101 
our random sample of 5. 


» 49, 7, and go would then constitute 


ing tables of random numbers may be 


tions accompanying those tables x "The 
1 See the aforementioned tables 


by Fishi 
No. 15, Random Sampling Numbers, » LH. C Tipe Mini га 
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methods here described, however, will be adequate for most situa- 
tions met in educational research. 

It may be worth while to point out here the possible defects in a 
certain method of sampling (of pupils) that has been frequently 
employed in educational research — the practice of making the 
selection schematically from a list in which the pupils’ names have 
been arranged in alphabetical order. For example, we might take 
every 6th name in an alphabetical list of 95 names to secure a 
sample of rs, or if we wished to split the group into two equal 
groups, we might do so by selecting alternate names from the 
list. A selection of this kind should be free from bias (unless the 
selection is made from only part of the list), so far as measures of 
central tendency are concerned, but it may nevertheless not be the 
equivalent of a random selection. This is because of the possibil- 
ity that alphabetized lists may be “stratified,” since pupils with 
the same last name (or names beginning with the same letter) may 
be related, or of the same nationality, or otherwise more nearly 
alike than pupils selected at random. In general, particularly 
since it involves so little trouble, there is no excuse for failing to use 
the “random numbers” type of selection in situations of the type 


described. 


CHAPTER II 


THE USE OF THE X? DISTRIBUTION IN TESTING 
HYPOTHESES 


I. INTRODUCTORY 

IN srrUATIONS in which the members of a random sample may be 
classified into mutually exclusive categories, we may sometimes 
wish to know whether the observed frequencies in these categories 
are consistent with some hypothesis concerning the relative fre- 
quencies in these categories in the entire population. To take a 
simple illustration, suppose we have asked each pupil in a random 
sample of 9o to indicate which of three School subjects he likes best, 
and that we find that 27 prefer subject A, 35 prefer B, and 28 prefer 
C. This seems to indicate that subject “В?” is liked better than 
the others. However, the Suggestion might be made that this is 
just the sort of distribution of responses we might readily get from 
а random sample of this size even though all subjects were equally 
well liked in the population at large. We might therefore set up 
the hypothesis that the distribution in the population is uniform, 


€ would expect the observed distribution 


erge” farther from a uniform distribution 
at hand. 


ch this hypothesis may be tested has already 
€ preceding chapter. We must first devise 
" divergence" of fact from hy- 


for samples of 90 to “di 
than did the one sample 

The manner in whi 
been suggested in th 
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since the only alternatives are that a very improbable event has 
occurred or that our sampling was faulty, neither of which may be 
acceptable. 

The statistic, x, needed to test this and a wide variety of similar 
hypotheses will be defined in the next section. Its sampling dis- 
tribution is generally considered one of the most important in all 
statistical theory. While its practical usefulness may not be as 
great in educational as in biological and other fields of research, it 
will nevertheless be worth our while to devote considerable atten- 
tion to it. 


2. THE MEANING OF X? 
The statistic X^ (chi-square) may be defined as 


х = E (f. "al iD (2) 
Í: 

in which f, represents the observed frequency in a single category, 
f, the corresponding theoretical or hypothetical frequency, and in 
which the X indicates that the terms (f, — f, )?/f, are to be summed 
for all categories. The manner in which X^ is computed may be 
ilustrated with the example already given. The observed fre- 
quencies (f,) in the preference categories A, B, and C are 27, 35, and 
28 respectively. The corresponding theoretical frequencies (f) 
are those that would have been found had the facts for our sample 
corresponded exactly with our hypothesis of uniform distribution 
in the population. Hence the theoretical frequencies are зо, зо, 
and 3o. The following tabular arrangement indicates the steps 
in the computation of X^. 


Preference Category Yor 7) (fo= 0 ee s 


+300 


-833 
-133 
1.266 = Ж 
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It should be immediately apparent that X^ is an index of the di- 
vergence of fact from hypothesis. If each of the observed fre- 
quencies agreed exactly with the corresponding theoretical. fre- 
quency, Х would be zero. The greater the divergence of the indi- 
vidual observed frequencies from the theoretical, the greater the 
value of X. It should be noted, however, that X^, being based on 
the squares of the deviations (f, — f), does not take the direction 
of the deviations into account. This is a limitation of x? which we 
shall consider later. 

It should also be apparent that x? may be used as a measure of 
divergence from any other hypothesis that we may wish to set up. 
For example, we may set up the hypothesis that in the whole popu- 
lation the ratio of preferences for A,B,and Cisas8:5:5. Under 
this hypothesis, the theoretical frequencies would be 4o, 25, and 25, 
and we could compute X^ with reference to these theoretical fre- 
quencies just as we did with reference to the frequencies 3o, 3o, and 
30 which were consistent with our first hypothesis. We may then 
use this procedure to measure the divergence in our sample from 
any set of theoretical frequencies that we may wish to write down, 
noting, however, that in this example these theoretical frequencies 
must add up to 90. This means that, in setting up an hypothesis in 
this instance, we may assign any values we please to two of the the- 
oretical frequencies, but the third will of course be completely fixed 
by the first two selected. For example, if we assign the values 10 
and 52 (selected at will) to two of the frequencies, the third must 
be 28 if their sum is to be 9o. We therefore say that there are 
only two degrees of freedom in this table — this concept, however, 
will be more adequately defined later. 

Let us now consider furth 


I er our first hypothesis — that of a 
uniform distribution. 


Tt should be obvious that even though this 
hypothesis were true, we could hardly expect the observed fre- 
quencies in any random sample to agree exactly with the theoreti- 
cal frequencies. Due to chance fluctuations in the observed fre- 
quencies, nearly all random samples would show a value of x? other 
than zero; in some samples the value of x? would be relatively large; 
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in others relatively small. To evaluate our data, then, we must 
know how X? is distributed for random samples with the given 
number of degrees of freedom. Before considering this sampling 
distribution, however, it will be well to consider more fully the 
meaning of the concept of degrees of freedom. 


3. DEGREES OF FREEDOM 

The number of degrees of freedom in a table of frequencies is the 
number of those frequencies to which we may assign arbitrary 
values and still satisfy the external requirements imposed on the 
table. If we think of each frequency as occupying a cell in the 
table, the degrees of freedom is the number of cells that may be 
filled at will. When the only restriction imposed is that the fre- 
quencies add up to a given total, the number of degrees of freedom 
is one less than the number of cells, as we have already seen in the 
illustration used in the preceding section. In some instances, how- 
ever, additional restrictions are imposed, and the number of de- 
grees of freedom correspondingly lessened. 

For example, we might have a four-celled table, like the fol- 


lowing 


II 59 7° 


20 97 


on which we impose the restriction that the cell frequencies in each 
row and column must add up to a fixed total for that row or col- 
umn. There are many combinations of cell frequencies, other 
than the one given, which will satisfy this requirement, as, for 
instance, those on page 34- However, we note that in writing in 
these frequencies, we could select only оле frequency at will in 
each table. Having decided to write 15 in the upper left-hand cell 
in the first example, we had no choice but to write 55 in the upper 
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15 | 55 | 70 t | Go | go 
5 | 42 | 47 I9 | 28 | 47 
20 97 20 97 


right-hand cell, 5 in the lower left-hand, and 42 in the one re- 
maining cell. In this table, then, because of the restrictions im- 
posed, we have only one degree of freedom. To illustrate further, 
suppose that we have a table with 3 rows and 7 columns. If we 
impose only the restriction that the 21 cell frequencies must add 
up to a single fixed total, we have 20 degrees of freedom in fill- 
ing the cells. If we specify that the column totals must equal 
certain fixed values, we have 14 degrees of freedom, 2 in each 
column. If we impose the further restriction that the row totals 
must also equal fixed values, so that both row and column totals 
are fixed, we have only r2 degrees of freedom. 


with tables like those just described 
S chapter. There are many other 
ght be imposed on a table besides 
tals. Illustrations of some of these 
esented later. The foregoing, 
diate purposes to indicate what 


а 4. THE SAMPLING DISTRIBUTION OF X? 
The importance of the concept of degrees of freedom in the pres- 


ent discussion lies in the fact that the form of the sampling dis- 
tribution of X? depends only upon the number of degrees of free- 
dom in the table f ich i 


of the sample (so long 
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Pearson, it is known that for any given number of degrees of free- 
dom (d.f.), the sampling distribution of X? is given by 
й—ї 

у = ere) > 
This equation will of course mean very little to the student not 
trained in mathematics, and no attempt will be made here to show 
how it was derived. However, from this equation it has been pos- 
sible to construct a table showing, for each of a number of degrees 
of freedom, what value of X^ is exceeded in each of a number of per- 
centages of random samples. This table is presented on page 36 
(Table 2). This table may be read as follows: For one degree of 
freedom (first row of table) we note that in 99 per cent of all ran- 
dom samples the value of X would exceed .000157. In 98 per 
cent it would exceed .000628, in 5 per cent it would exceed 3.841, 
in 2 per cent s.412, and in x per cent 6.635. For 6 degrees of 
freedom, x? would exceed 12.592 in 5 per cent of all random sam- 
ples, etc. The manner in which the table may be used will be 
made clear by the illustrations in the following sections. 

In general, as may be noted by examining the values in the mid- 
dle column (so per cent) of Table 2, the median value of X under 
a true hypothesis is very nearly the same as the number of de- 
grees of freedom. Hence if a value of X^ less than the number of 


| degrees of freedom is obtained, we may conclude at once that the 


hypothesis is tenable without bothering to refer to Table 2. 

When the number of degrees of freedom exceeds 3o, the proba- 
bility of exceeding any given value of X* may be read from the nor- 
mal probability integral table by finding the normal deviate equiv- 
alent of X?. For example, suppose X is 48.52 for 40 degrees of free- 
dom. The normal deviate equivalent of X^ is then 
$X—V2a2x'—wvaodf—zi = м2 Х 48.52 — V2X40—1- 

V 97-04 — V/79 = 9.85 — 8.89 = .96. 

The probability of exceeding the given value of X is then the same 
as the probability that a measure selected at random from a nor- 
mal distribution will lie to the right of a point + .96 с from the 
mean. ш x happened to be negative, the same interpretation ap- 
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plies. For example, if x = — .62, the probability of exceeding the 
corresponding value of X° is the same as the probability that a 
measure selected at random from a normal distribution will lie to 
the right of a point — .62 с from the mean. The probability in 
this case would be about .732.) 


5. TESTS OF GOODNESS OF FIT 
One important class of tests in which x? is employed consists of 
tests of goodness of fit, meaning that they are tests of whether or 
not a table of observed frequencies “fits” or is consistent with a cor- 
responding set of theoretical frequencies conforming to a given 
hypothesis. The illustration on page 31 is of this type. In that 
example, the observed value cf X? was 1.266, and the number of 
degrees of freedom involved was 2. According to Table 2, for 
2 df., X"s as large as 1.266 would be exceeded in between 7o 
per cent and so per cent of all random samples if our hypothesis 
were true. In other words, the observed value of 1.266 is just 
about what we would generally expect if the hypothesis were true. 
This of course does not establish that our hypothesis 25 true, but it 
certainly leaves us with no grounds for rejecting it as an hypothesis. 
Suppose, however, that for the same sample we test the hypoth- 
esis for which the theoretical frequencies are 40, 25; and 25. In this 
case we would have 


(fo—ft) (fo— fi)? (fo— 70° 
f 


Preference Category 


— 13 169 4.225 
Io 100 4.000 
3 9 -360 

T ou 9:585; XE 


From Table 2 we see that, for 2 d.f., values of X" as large as 8.585 
would in the long run be found in less than 2 per cent of all random 
Samples, In other words, if our hypothesis were true, we could 
expect only once or twice in a hundred to find a sample which 


38 THE USE or THE X^ DISTRIBUTION IN TESTING HYPOTHESES 


diverged as far from expectation as the sample we have. Hence, 
we must conclude, either that the hypothesis is true and that a very 
improbable event has occurred, or that the hypothesis is false (as- 
suming that our sample is truly random). Unless we were being 
very conservative, then, we would probably feel justified in reject- 
ing this hypothesis. 

A procedure similar to that already described may be used to 
test the hypothesis that the frequency distribution of some con- 
tinuous variable has a given form of distribution for a given popu- 
lation. Suppose, for example, that we wish to test the hypothesis 
that the scores on a certain psychological examination are nor- 
mally distributed for a certain population of school children. 
pose we have taken a random sample of 332 cases from this 
lation, and find that the scores are distributed as follows: 


Sup- 
popu- 


Frequency 
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mean and S.D. of the sample. In other words, our theoretical 
distribution must be normal, and have a mean of 73.54 and an S.D. 
of 28.07. How these theoretical frequencies may be obtained may 
be illustrated in the case of the interval 100-109. The upper limit 
of this interval is 100.5, which is 1.281 S.D. from the mean; the 
lower limit (99.5) is .925 S.D. from the mean. Hence, according 
to the normal probability integral table; we would expect, in a 
normal distribution, to find 39.99 — 32.25 — 7.74 per cent of the 
cases in this interval. Since the total number of cases is 332, the 
theoretical frequency for the interval is 332 X .0774 = 25.70. The 
theoretical frequencies for each of the intervals is given in the table 
below. It will be noted that the three original upper intervals have 
been combined into one, as well as the three original lower intervals. 
This is because the X? test should not be applied to a table in which 
any theoretical frequency is very small, and we comtined these in- 
tervals to avoid such small frequencies. The steps in the computa- 
tion of X? are indicated in the table below. 
(»— ft)? 


Л 
130 and above .2306 
120-129 0696 
IIO-IIQ .8049 
100-109 Cs 
9o T 
ga 1.7034 


80-89 <6 
70-79 ps 


60-6 

ote 1.5669 

40-49 E. 

30-39 5 

29 and below _ 3 
14.8573 = X? 


* A table appropriate to this purpose may be found in any introductory text in 
Statistics, The table (Table 17) given in the Appendix to this book is not designed for 
this purpose, since it is to be entered with the probability to find the corresponding 
deviation, and also because the probability is derived from the sum of the areas in 
Corresponding segments of both tails of the distribution. 
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Now, before we may evaluate the observed X^, we must determine 
the number of degrees of freedom. If the theoretical frequencies 
had been determined independently of the observed frequencies, 
subject only to the restriction that their sum be 332, we would 
have 11 degrees of freedom — one less than the number of in- 
tervals. In this case, however, we have imposed the addi- 
tional restrictions that the mean of the theoretical distribu- 
tion be 73.54 and that its S.D. be 28.07. In such instances, 
we follow the rule that the number of degrees of freedom is reduced 
by one for each constant that has been derived from the observed 
frequencies. Since two constants (the M and S.D.) have been 
derived in this case, the number of degrees of freedom is 11—2 
Or 9. 

In Table 2, we see that for 9 degrees of freedom, X^ exceeds 14.86 
just slightly less than то per cent of the time. We hardly have an 
adequate basis, therefore, for rejecting the hypothesis of normal- 
ity, although the results certainly do not give us added confidence 
in that hypothesis. 

It has been previously noted that one limitation of X? аз ап index 
of divergence is that it does not take into consideration the signs of 
the individual deviations. This limitation is particularly serious 
in a test of the type just illustrated. In the preceding table, the 
signs in parentheses following the theoretical frequencies indicate 
the direction of the differences between the observed and theoret- 
ical frequencies. We note that there is a strong tendency for the 
outlying deviations to be positive and for those near the middle to 
be negative. In other words, the observed distribution shows a 
marked tendency to be “flatter” or less peaked than a normal dis- 
tribution. This pattern of signs, which is ignored in the x’ test, 
constitutes strong evidence that the sample was not drawn from a 
normal population, and it is probable that if we had applied some 
more efficient test * of goodness of fit that takes these signs into 
consideration, we could quite confidently have rejected the hy- 
pothesis of normality. (Several actual instances in which the 

1 See К. A. Fisher, Statistical Methods, Chapter III, Section 14. 
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X^ test proved unsatisfactory for this reason are presented in 
Table 10, page 138.) 


6. TESTS OF INDEPENDENCE 
Я Another important class of tests involving X7 consists of tests of 
independence, meaning that they are tests of the hypothesis that 
two variables or attributes are unrelated. The simplest case is 
that in which each classification is dichotomous, or in which there 
are only two categories for each variable. For example, we might 
wish to know if there is any significant difference in the perform- 
ance of boys and girls on certain items in a general science exam- 
ination. Suppose that, in a random sample of 150 pupils, consist- 
ing of 80 boys and 7o girls, 56 of the boys respond correctly to a 
given item, while only 34 of the girls respond correctly. These 
facts may be arranged in a table, as follows (the numbers in paren- 


theses will be explained later): 


Right Wrong 
Boys 56 24 

(48) | (32) | 80 
Girls 34 36 

(42) | (28) | 7° 


go 60 150 
here is no relationship be- 


or that in the population 
ond cor- 


The hypothesis we wish to test is that t 
tween sex and performance on this item, 
at large equal proportions of boys and girls would resp 
rectly to the item. Since we are not interested in the true propor- 
tions, but only in whether or not they are independent, we must 
make our theoretical frequencies conform to the observed marginal 


totals We note that 2 of the total sample made the correct 
150 


response, hence the observed frequencies would have agreed ex- 


actly with our hypothesis if 9 of both the boys and girls had 
150 
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made the correct response, that is, if = X 80 = 48 boys and 


= X 70 = 42 girls had responded correctly. This would leave 
32 boys and 28 girls in the column of incorrect responses. These 
theoretical frequencies are given in parentheses in the table above. 
It may be noted that the theoretical frequency in each cell is 
equal to the product of the corresponding marginal totals divided 
by the grand total. Only one need be thus computed, however, 
since the others may be obtained by subtraction from the marginal 
totals. It may be noted also that the difference (f, — fi) between 
observed and theoretical frequencies is the same (8) for each cell. 
The value of X^ is then 


We have already noted that there is only one degree of freedom in 
a 2X2 table when the restriction is imposed that the cell fre- 
quencies in each row and column add up to fixed totals. In Table 
2, for 1 d.f., we note that X^ will exceed 7.14 less than т per cent 
of the time. Hence, in this case we can reject the hypothesis of 
inaependence with a very high degree of confidence. 

It should be clearly understood that while this test may reveal 
that there is some relationship between the traits involved, it does 
not indicate the degree of relationship. That is, a larger X^ in an- 
other table (or a correspondingly lower probability that it is due 
to chance) would not necessaily mean a higher relationship, but 
only that we may more confidently assert that some relationship 
exists. 

Tests of independence may similarly be applied to any contin- 
gency table, regardless of the numbers of rows and columns. In 
any case, the theoretical frequency in each cell is equal to the 
product of the corresponding row and column totals divided by 
the grand total, and the number of degrees of freedom is the 
product of one less than the number of rows and one less than the 
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number of columns. It might be well to consider one illustration 
of this type. Suppose we wish to know if there is any relationship 
between the letter grades given pupils in high-school physics and 
the number of courses in high-school mathematics previously taken 
by the pupils. Suppose we classify the pupils into 3 groups as to 
numbers of courses: those having had one course or less, those hav- 
ing had two courses, and those who have had three or more. For 
a sample of 1645, the contingency table for “ grades” and “courses” 
might then be as follows: 


Courses 
A B с р 


36 76 171 55 12 
3ormore | (27.23) | (66.17) | (158.94) | (75.96) (21.70) 


Ы 206 


73 165 377 51 
(6785) | (164.86) | (395.98) | (189.24) (54.07) 


1 or less 19 7° 199 96 39 
(32.91) | (70:07) | (192.00) (91.80) | (26.23) 


г left-hand cell, for example, 
theoretical fre- 
is therefore 


The theoretical frequency in the uppe 
is 350 X 128/1645 = 27.23, and the remaining 
quencies are similarly computed. The value of X* 
(36 — 27.23)2/27.23 + (76 — 66.17)*/66.17 + ...: 
+ (73 — 67.85)?/67.85 + + (19 — 32.91)/32.91 +.... 
+ (39 — 26.23)°/26.23 = 32.0589 
For 8 df., this value of x? is exceede 
the time under a true hypothesis. Hence, we may confidently 
reject the hypothesis that there is no relation between the number 
9f mathematics courses taken by high-school students and their 
Subsequent grades in high-school physics. 


d much less than 1 per cent of 


7. TESTS OF HOMOGENEITY 
Tests of independence may sometimes be considered, from an- 
Other point of view, as tests of homogeneity. The example on page 
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41, for instance, might be considered as involving a test of the 
hypothesis that the two sexes are homogeneous (that is, alike) in 
their responses to the given test item. The following example may 
be more representative of tests of homogeneity in educational re- 
search. In a study of the difficulty of the items in an algebra ex- 
amination, it was found that the per cent of correct responses for 
certain items varied considerably from school to school. The 
question then arose of whether these variations were larger than 
could be attributed to chance, or were indicative of real differences 
in the difficulty of the items from school to school. In other words, 
were the schools fundamentally homogeneous with reference to 
pupil performance on these items? For one of the items, the num- 


bers of correct and incorrect responses in ro schools are presented 
in the table below. 


Number Number 
Correct Wrong 
(R) R/T 
1.2500 
9.3023 
10.8936 
14.6944 
5.4915 
18.9252 
2.8929 
15.2105 
2.8810 
3.1605 


84.7019 


I 
2 
3 
4 
5 
6 
7 
8 
9 


Io 
Totals 


EN 210) = 
b= се = 36348: q = .63652; bg = .23136 


- I 
re mio (84.7019 — 213 X .36348) — 31.47 


The method of computing X^ which has been employed in this 
table is the exact algebraic equivalent of the method described in 
the preceding section, but takes less time. For data that have 
been arranged in 2 columns (in this case the R and W columns) the 
steps in this method of computation are as follows (the results fot 
the example are given in brackets under each step): 

(т) For each row separately, square the first frequency and 

divide by the sum of the two frequencies. 
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[For first row: 52/(5 + 15) = R?/T = 1.2500] 
(2) Add these quotients for all rows. 
[84.7019] 

(3) Divide the first column total by the grand total. Call this 

result р. 

[ = 213/586 = .36348] 
(4) Find q = т — р, and compute fq. 
[4 = 1 — .36348 = .63652; pg = 23136] 

(s) Multiply the first column total by $, and subtract this 

product from the sum obtained in step (2). 

[84.7019 — 213 X .36348 = 7.2805] 
(6) Divide the result of step (5) by pg to get KA 
[Xî = 7.2805/.23136 = 31.47] 


If desired, the computations may be based on the second column 
rather than the first, that is, second" may be substituted for 
“first” in steps (1), (3), and (5). It will be apparent from step (1) 
that the computation will be more convenient if based on the col- 
Шап containing the smaller frequencies. It is desirable, when us- 
ing this method, to carry the results of step (1) to at least three 
decimal places, and those of steps (3) to (5) to at least five. This 
method of computation may of course be applied only when one 
of the classifications contains only two classes. In tests of homo- 
geneity involving several classes in each classification, the method 
of computation described in the preceding section must be applied. 
Я As in any test of independence, the number of degrees of freedom 
In the contingency table is one less than the number of rows times 
опе less than the number of columns. In this case, there are 
9X1=g df. In Table 2 we sce that, for 9 d.f., the observed 
value of x? goes far beyond the 1 per cent point (21.666). Hence 
We may be practically certain that these schools are not homo- 
8eneous in responses to this item, or that there are real differences 
in the difficulty of the item from school to school. | 

Incidentally, the data just considered are fairly representative 
9f those that would be found for most items in most school exam- 
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inations intended for widespread use. It may be worth while, 
therefore, to draw attention to certain important implications of 
these facts. We may note, in the example, that the per cent of 
correct responses in the entire sample was 36.35. Since these 
schools are not homogeneous, it follows that we may not consider 
the pupils involved as a random sample of pupils from any popula- 
tion, and may therefore not apply the formula for the standard 


x(1oo — x) 


error of a percentage (s. = N 
1 


) describe the reli- 
ability of this percentage. It means also that, in preliminary try- 
outs for difficulty of items intended for use in standardized tests, 
the important consideration is not the number of pupils involved 
in the try-out, but the number of schools represented. 


8. COMBINING PROBABILITIES FROM INDEPENDENT 
TESTS OF SIGNIFICANCE 

Suppose that three investigators have each conducted an inde- 
pendent experiment to determine the relative effectiveness of the 
same two methods of instruction, and that each has drawn his sam- 
ples at random from the same population. Suppose that each has 
used a different criterion test, and that it is therefore impossible 
to throw all results for each method into a single distribution for 
the purpose of a single combined comparison. The first investi- 
gator finds that, under the null hypothesis, the chances are only 
8.2 In тоо of getting a chance difference as large as that he ob- 
tained; i.e. the probability is .о82. The corresponding probabil- 
ities for the second and third investigators are .115 and .o60 re- 
spectively. None has found a “significant” difference, but all 
observed differences favor the same method. Are the collective 
results significant of a real difference in favor of this method? 

t This problem may be solved by thinking of each of these proba- 
bilities as corresponding to a given value of X*. It is known that the 
sum of a number of independent values of x? is itself distributed aS 
Xî, with a number of degrees of freedom equal to the sum of those 
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for the separate X”s. It also happens that, for 2 d.f., the value of X? 
which is exceeded in any given proportion (р) of cases is equal to 
— 2log. р. Hence we may think of any probability as correspond- 
ing to a X? whose value is — 2 log, p (for2d.f.). We can thus com- 
pute a X? for each probability, add these X"'s to secure a composite 
X^, and then evaluate this composite X^ in the ordinary manner, 
its d.f. being twice the number of probabilities combined. 

The rules for combining a number of probabilities are therefore 
as follows (these rules are for use with the more accessible tables 
of common logarithms): 

(1) Find the common logarithm of each probability. 

(2) Add these logarithms and change the sign of the result. 

(3) Multiply this result by 4.60517 (— 2 X 2.302585) to get the 
composite X*. (The number 2.302585 is the “modulus con- 
stant" which transforms a common logarithm to a natural 
logarithm). 

(4) The number of d.f. for the composite X? is twice the number of 
probabilities involved. , 

For the example already given, the computation is as follows: 


[4 log, p 
082 8.91381-10 
л15 9.06070-10 
.060 8.77815-10 


26.75266-30 = — 3.24734 
X? = 4.60517 X 3.24734 = 14.95 
For 6 d.f., this value of X^ is exceeded between 5 per cent and 
2 per cent of the time. Hence the collective results are signifi- 
cant at the 5 per cent level, whereas no individual probability 


reached that level of significance. 


CHAPTER III 
SMALL SAMPLE ERROR THEORY 


I. AN IMPROVED ESTIMATE OF THE TRUE STANDARD DEVIATION 
We HAVE noted earlier (page 19) that the standard deviations of 
small random samples tend to be smaller than the standard devia- 
tion of the population, and that this is one of the reasons why it is 
invalid to use the observed standard deviation of a small sample 
in the formula for the standard error of the mean. We shall now 
consider the proof of this statement, and at the same time derive а 
formula for securing an unbiased estimate of the true standard 
deviation from the data for a single random sample. 

Tt will be convenient in this discussion to deal with the square 
of the standard deviation, rather than with the standard deviation 
itself. The square of the standard deviation of any distribution is 
known as the variance of the distribution. Accordingly, if d repre- 
sents the deviation of an individual measure from the mean of а 


sample, and 2 represents the number of measures, the variance of 
the sample is given by 


g? = ——. (3) 


% 

Now suppose that we take r random samples of 2 cases each from 
a population whose true mean is M. We shall call the first sample 
drawn sample 1, and let M, represent its mean. We shall also 
let d; represent the deviation from M, of any measure in sample 1; 
and let d; represent the deviation of the same measure from the 
population mean M. It then follows that 

d; = d, — (M — M). 
Squaring both sides of this equality, we have 
dy —di—2»d(M — Mj) + (M — My. 

If we now sum such expressions for all of the 2 measures in sam- 
ple т, we get 


Zd? -Zd;—»(M — MJZ d, + М – My, 
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which, since Z d, — o, reduces, after transposition of terms and 
change of signs, to 
ZdieZdj)-m(M- My. 


We may of course write a similar expression for each of our 7 sam- 
ples. These expressions may be arranged in column order for con- 
venience in summing, as follows: 


Уй = 5 а – М – М,)° 
E dj = Td? — н(М – Mj. 
Z dj = dẹ? — п(М – M,)? (4) 


Уд = I dF – n(M – М)" 


The sum of the left-hand terms in (4) may be written as 2 (2 d’), 
meaning the sum of the sums of squared deviations from the sam- 
ple means. The sum of the first of the right-hand terms would be 
the grand total of the squared deviations from the population mean 
for all of the nr = N measures in all of the samples. This grand 
total may be written simply as Z d^, with the understanding that 
the summation is over all samples. The sum of the last terms in 
(4) may similarly be written as n (М — M,)?, in which the sub- 
script û is a general term referring to any sample, and in which the 
summation is for all values of p from 1 tor. Under this notation, 
we may write the sum of the expressions in (4) as 


ZE 0) = 3 d” — n (MU — Му, (5) 
which, upon dividing through by N = mr, leads to 


ZZ d) .E d" nZ(M = My 
rn N nr й 


з Ed" (М-м) " 


or to 


50 SMALL SAMPLE ERROR THEORY 

We may now note that the left-hand term of (6) is the mean 
value of E5) for all r samples. Let us now let r (and hence N 
as well) become infinitely large. The left-hand term is still the 
mean value of 22) for all samples, but the first right-hand term 


becomes the true variance of the population, or с. The second 
right-hand term becomes the variance of the sampling distribution 
of observed means, i.e., the square of the standard error of an ob- 
served mean, or cf. But we know that 


Hence, (6) may be written as 


2 


2d I п І 
code. ГЕК" 
Mean value of ( "Ui 598% = T. Ue 


It is now apparent from this equation that the mean value of 


zd ? 
(52), that is, the mean of the sample variances, is less than the 


population variance, since (n — 1)/n is of course less than unity. 
This is but another way of saying that the variance of a sample 
tends to be less than the variance of the population, which is what 


We set out to prove. Furthermore, by multiplying through in this 
equation by n/(n — 1), we get 


Mean value of ( = ) = Chop. 
n 


= 


_ This means that for a single random sample ET is an unbiased 


тт 
estimate of the population variance, since in the long run its mean 
value is exactly eq 


ual to the true variance. Itison this basis, then, 
that we are able to write the formula 


> Ф 


est'd 03, = 
7—8 


(7) 


THE SAMPLING DISTRIBUTION Or “і” 5X 
2. THE SAMPLING DISTRIBUTION OF ^" 


. Mo- M, . 
We have observed (p. 19) that the ratio MED COAT is not nor- 
est'd сү 


mally distributed for small samples, and that therefore we may not 
use the normal probability integral table to evaluate the signifi- 
cance of this ratio when the sample is small. This is true even 
Ed 


И cun 


though we get an improved estimate of с, by substituting 5 


rather than EE for the т» in the formula for the standard 
Nm 


error, as follows: 


[z d 
est'd oy os mt hu (8) 


уп 


Mos My in testing the hypoth- 


ratio 
If, then, we are to use the est'd oy 


esis that the true mean has a value My, we must know the exact 
sampling distribution of this ratio. This ratio, using the improved 
estimate of oy, is known as the “¢” statistic. That is, 

М, ОГУ" М, H 


| zd (9) 
n(n — 1) 


An English statistician, writing under the pen name of “Student,” 
has shown that for random samples drawn from normal popula- 
tions, the sampling distribution of ¢ is given by 

Yo 


pu Р y 
DRE z 
гҮ 


in which z is the number of cases in the sample. We may note that 
the denominator of this expression has its minimum value when 
# = o, and that therefore the curve has its maximum ordinate at 
this point. We note also that the ordinate y is the same for a given 


i= 
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value of ¢ whether ¢ is positive or negative and that the distribution 
is therefore symmetrical. Finally, as ¢ increases the ordinate y 
decreases, and the curve is asymptotic to the baseline. The curve 
is much like the normal curve except that it is more peaked for 
small values of 7, as we have noted previously. For large values of 
т, the distribution approaches the form of the normal distribution. 

A different distribution of £, of course, will be found for each 
value of 7, ог for each number of degrees of freedom. The number 
of degrees of freedom for any value of / is one less than the number 
of cases involved, that is, 4]. = п — 1. Fisher has prepared а 
table * showing, for each value of d.f. from т to зо, what absolute 
value of ¢ would be exceeded in т per cent of all samples, as well as 
in 2 per cent, 5 per cent, ..... 90 per cent of all such samples. 
This table is reproduced on page 53. 

The table for / may be read in the same manner as the table for 
X". (The statistic / is a measure of the divergence of fact from hy- 
pothesis, just as is x?, although a different type of hypothesis is 
involved.) For example, for 3 d.f. (that is, for samples of 4 cases 
each), t exceeds the value .т 37 in 9o per cent of all random samples 
of this size if the hypothesis is true. For the same size sample, £ 
exceeds .277 in 80 per cent of such samples, or exceeds 4.541 in 
2 per cent, etc. It should be remembered that the table tells what 
absolute value of 1 (either positive or negative) is exceeded in à 
given per cent of random samples. Thus, while for a true hypoth- 
esis has an absolute value greater than 1.729 in ro per cent of all 
samples of 20 (d.f. = то), its value will exceed + 1.729 just 5 
per cent of the time, or lie below — 1.729 just 5 per cent of the time 
for samples of this size. 

It will be noted, in the last column of Table 3, that the value of 
t demanded for a given level of significance (the 1 per cent level) 


imquma ‘POH puv зәлцо Aq 9/z1 3v paysyqnd 'soyea ‘J pue оце "ү "у Aq $9490, 1021/0101 ut uonooroo 19918] 
9m 03 uaap sr uonuony — 'q3nqurpq ‘pAog ров 12410 Aq /$1 3? poysyqnd ‘оце! "V "Y Aq 's22440 41 2102520 


19ori 119 
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becomes larger as the sample becomes smaller. In samples of 
31 cases (d.f. = 30), for example, / must exceed 2.750 to be signifi- 
cant at the 1 per cent level, but for a sample of 6 cases / must ex- 
ceed 4.032 to be equally significant. This is consistent with the 
fact that as the sample becomes smaller, the estimated c, in the 
denominator of /, and hence ¢ itself, will become more variable. 
This does not mean that the sampling distribution of ¢ is inexact 
for small samples. On the contrary, the sampling distribution of 
1 is described exactly in terms of £ itself, and does not require any 
estimate of a population parameter. While the denominator in £ 
may be considered as an estimate of the standard error of the 
mean, ¢ itself must be considered as a statistic computed entirely 
from the data given by the sample. 


3. THE SIGNIFICANCE OF THE MEAN OF A SMALL SAMPLE 
We may now consider a concrete illustration of the use of the 
t-test. Suppose we have selected a random sample of ro girls from 
a certain population of elementary school girls, and have found 
their weights to be 6o, 68, 54, 59, 67, 62, 52, 59, 56, and 63 pounds. 
The mean of these weights is бо. Now Suppose we have some ex- 
act hypothesis concerning the value of the true mean, such that the 


true mean is 56 pounds. With reference to this value of Му, the 
value of ¢ is 


_ Mo- M; __ 60 — 56 


t == 4 M 
J Ed 244 1.65 и 
n(n — 1) 1o(1o — 1) 


Turning now to Table 3 we see that, if our hypothesis is true, an 
absolute value of ¢ greater than 2.4 would be found in between 
5 per cent and 2 per cent of random samples of this size (d.f. — 9). 
Hence we must conclude, either that our hypothesis is true and 
that something has happened in our one sample that happens less 
than once in 20 times, or that our hypothesis is false. Whether or 
not we reject the hypothesis, then, depends upon the level of sig- 
nificance that we have arbitrarily decided to require. This ! is 
significant at the 5 per cent level, but not at the 2 per cent level. 
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It may be well to emphasize certain features of the logic involved 
in this t-test. In the preceding example, we found that if the true 
mean were 56, the chances would be less than 5 in тоо of getting 
а Las large as 2.4 (in absolute value) in a random sample of this 
size. If our assumptions of random sampling and a normal popu- 
lation are satisfied, this is an exact statement, involving no esti- 
mates or approximations whatever. This, however, does not 
mean that if the true mean were 56 the chances would be less than 
5 in тоо of getting an obtained mean as high as бо or as low as 52. 
Depending upon the population c, these chances might be either 
greater or less than 5 in тоо. Nevertheless, if ¢ is significant at the 
1 per cent level, we can be highly confident that our hypoth con- 
cerning the true mean is false. 

The t-test may of course be employed to test any other exact 
hypothesis as to the value of the true mean, and it will often be de- 
sirable to establish the limiting values outside of which any such 
hypothesis may be rejected with a given degree of confidence. 
Suppose, for example, that we wished to determine the highest and 
lowest hypothetical values of the true mean of the population of 
elementary school girls which would be admissible at the т per cent 
level. To do this, we would find the value of /, in the last column 
of Table 3, for 0 d.f. We would then substitute this value and our 
estimate of the standard error of the mean in the formula for ż, and 
solve for Mo — My. 

The computation in this case is as follows: 


| Mo- Ms. _ Mo = M, 
` est'd oy 1.65 


Mo — Mg = 1.65 X 3.25 = 5.36 
Hence the limiting values of the true mean are бо + 5.36, or 54.64 
and 65.36. We may then, at the т per cent level of confidence, 
reject any hypothesis that the true mean lies beyond these limits, 
or, otherwise stated, we may be "practically certain" that the true 
mean is within these limits. 
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4. THE SIGNIFICANCE OF A DIFFERENCE IN THE MEANS OF 
INDEPENDENT SMALL SAMPLES 


The t-test of the significance of a difference between the means of 
two small samples is of much the same character as that just 
described. The hypothesis we really wish to test is that the two 
samples were drawn at random from populations whose means are 
equal, that is, we may not be interested in other characteristics of 
the population, such as the standard deviation. We cannot use 
the t-test to test an hypothesis concerned only with the difference 
in the population means, since if the samples are drawn from popu- 
lations whose means are equal but whose variances differ, the 1's 
computed from a series of pairs of samples like these would not be 
distributed exactly as is indicated by the table. However, we 
can use it to test the hypothesis that both samples were drawn at 
random from identical normal populations, i.e., from normal popu- 
lations with the same mean and same standard deviation: This, 
of course, is equivalent to saying that they were drawn from the 
same population, since Populations with identical distributions 
may be considered as constituting a single population. 

If this hypothesis be true, then, according to the reasoning of the 


preceding sections, the best estimate of Tyo» that we can make 
from the first sample alone is 


est'd т) = 28. 


т = І 


in which d, represents a deviation from the mean (M,) of the first 
sample, and ж, is the number of cases. A similar estimate may of 
course be made from the second sample. However, a still better 


estimate? of 75, May be secured by considering both samples 
together. This best estimate is 


est'd ср, = 221524 


* It has been suggested that the /-test may also be considered as a test of the hy- 
pothesis that the true meansare equal, on the further hypothesis that the true variances 
are equal. We may, if we wish, use the test later described in Section 6, page 60, 
to determine whether or not the latter hypothesis (of equal variances) is tenable. 


н 2 е proof of this, which is similar to that for (7), is left as an exercise for the stu- 
ent. 


—— 
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Using this estimate of т, our best estimate of the standard error 


of M, is 
est'd сь, 2 2 
est’d oy = — = = sara, |2 
E Y Ny Ny 


"T +N, — 2 


The estimated standard error of M, would be similarly obtained. 
Now, since the standard error of a difference between two inde- 
pendent measures is the square root of the sum of the squares of 
their respective standard errors (тшу = V o} + o} ), it follows that 
our best estimate of the standard error of the difference between 
M, and M, is 


E i ac (21120) (=+ 1) (10) 
ig nil 7,—2 m M 
Now # may be defined as 
__M,-M, 
` est'd Cy =a, 
or 
= M,— М, 
Ге (2+2) (11) 
Ny +N, — 2 Ny Ma 


for which the number of degrees of freedom is d.f. = n, + n, — 2. 

To illustrate the ¢-test as applied to a difference, suppose we have 
two random samples, each from a different population, and that the 
individual measures for these samples are as follows: 

Sample т: 43, 37, 50, 23,32,31 (yz = 6) 

Sample 2: 30, 24, 15, 42, 28, 19, 35,7 (n, = 8) 
The mean of the first sample is 36, and of the second is 25. The 
value of t is accordingly 

36 — 25 Колу Е 


456 x) Е 3 5.71 
=+- 
6+8—2 6 8 


Now from Table 3 we find (in the row for d.f. = 6 + 8 2 = 12) 
that in a large number of pairs of samples like this a value of ¢ as 


1.93 
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large as 1.93 would be found between 10 per cent and 5 per cent 
of the time. This value of ¢, then, fails to be significant at the 
5 per cent level. If we were using conservative standards, there- 
fore, we would not feel justified in rejecting the null hypothesis 
in this case. 

It is important to remember the exact nature of the hypothesis 
that we have tested. When the value of / exceeds that required 
for a given level of significance we may, with a corresponding de- 
gree of confidence, reject the hypothesis that the samples were 
drawn at random from the same or identical populations. While 
this is equivalent to saying that the samples came from different 
populations, it is not equivalent to saying that the means of these 
populations necessarily differ. It is possible, though improbable, 
that the samples came from populations whose means are the same 
but whose standard deviations differ. In most applications this 
possibility need not concern us greatly, and we may generally be 
quite confident that the means do differ if ¢ is highly significant. 
However, in case there is any doubt, it might be well to make a 
Separate test of the hypothesis that the true variance is the same 
for both samples. A method of testing this hypothesis will be 
presented in section 6 following. 


5- THE SIGNIFICANCE OF A DIFFERENCE IN THE MEANS OF 
RELATED MEASURES 

Quite frequently, when selecting two samples for the purpose of 
evaluating the effect of a variation in a given factor, we may select 
the cases in pairs, such that the members of a pair are more likely 
to be similar than two cases independently selected. One of our 
samples is then made up of the first members of the pairs; the sec- 
ond sample is made up of the second members. For example, in à 
methods experiment, we may select pairs of pupils who have made 
equal scores on an intelligence test, and hence may expect them to 
be more nearly alike in achievement than randomly paired pupils. 
Again, we might sometimes have two measures for each pupil, 
one secured before and the other after a given “treatment,” and 
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may wish to know if the “treatment” has affected the mean 
status of the pupils. 

In such cases, if the paired measures are correlated, we may find 
the difference for each pair and then, for this distribution of differ- 
ences, we may determine whether or not the mean of the distribu- 
tion (the mean difference) differs significantly from zero. Suppose, 
for example, in a methods experiment in which the pupils had been 
originally “matched ” for intelligence, the final measures of achieve- 
ment were as follows: 


ACHIEVEMENT SCORES 
Sample x Sample 2 Differences 

20 16 

34 35 

24 22 

37 29 

23 24 

35 3o 

3o 27 

29 25 


kh] 
B. 
5 


1 
2 
3 
4 
5 
6 
7 
8 


29.00 26.00 


The mean of the differences is 3.00. Hence, for My = o, the 
value of ¢ for E par of differences is 
— 


EIE = am = x 
۷ n(n — x) — 1) 8(8 — 1) 


If our hypothesis were true, an absolute value of / this large would 
be found less than 5 per cent of the time (d.f. = 7). Hence, we 
may be reasonably confident that the observed difference is not 
due entirely to chance. 

It is worth noting that had we considered the measures in this 
example as independent measures, and had applied the /-test 
described in the preceding section, we would have found a t of т.от 
(with тд degrees of freedom), which would have made the differ- 


ence seem far from significant. 


= 2.80 
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6. THE SIGNIFICANCE OF A DIFFERENCE IN VARIABILITY FOR SMALL 
SAMPLES: THE “Е” TEST 
The significance of a difference in the standard deviations of two 
large samples is usually tested by comparing the observed differ- 
ence with its standard error, computed as 


=4 [z 2 
To, =c, 9$, + [A 


in which the standard errors of the standard deviations are esti- 
mated as 


ga m 
r 


or, and LL 
V 21, * Van, 

This procedure breaks down for small samples, primarily be- 
cause the standard deviations of small samples are not normally 
distributed. However, the significance of a difference in the 
standard deviations of two small samples may be tested in a man- 
ner similar to that employed with means. The hypothesis we wish 
to test is that the samples were drawn from equally variable popu- 
lations. Rather than deal directly with the difference between the 
observed c's, we will deal with the ratio 
estimates of the true variances. 


defined as 


between the corresponding 
This “variance ratio" may be 


2 
g. 
Е = —. 
с; 
À i "_ xdi exu x 
in which of = <. and e? = *, orin which с^ and ø, are 
Ny — 1 т — І 


the estimated o’s of the populations sampled. 

The ratio F is always taken so that the larger variance is in the 
numerator. The number of degrees of freedom for each variance 
is one less than the л on Which it is based. 

The test of significance Which is based on this ratio is due to В. А. 
Fisher, who showed how the function 2, defined as 

E o 
Е log, of 
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is distributed for pairs of random samples of various size combina- 
tions all drawn from the same population. Fisher prepared 
tables * showing, for various combinations of degrees of freedom, 
how large a value of z is exceeded in 5 per cent, 1 per cent, and 
0.1 per cent of an infinite number of pairs of samples of the given 
size combination. These tables, however, are relatively incon- 
venient to use; to use them, we must find the logarithm of the vari- 
ance ratio. To avoid this inconvenience, С. W. Snedecor com- 
puted the value of F corresponding to the value of z for each com- 
bination of degrees of freedom within a useful range, and prepared 
tables? for F similar to Fisher's tables for z. 

Snedecor’s table for F is reproduced on the following pages 
(Table 4). The columns in this table correspond to the number of 
degrees of freedom for the larger variance, the rows to that for the 
smaller variance. Within each cell of the table two values of F are 
given. The upper number in each pair is the value of F that would 
be exceeded in 5 per cent of an infinite number of pairs of random 
samples (with the given numbers of degrees of freedom) all drawn 
from the same population. The lower number in each pair is the 
value of F that would be exceeded 1 per cent of the time. The 
5 per cent and т per cent points for combinations of degrees of free- 
dom not given in the table may be secured by interpolation be- 
tween those for the nearest combinations given. 

To illustrate the use of this table, suppose that for a sample of 


У) gd: х д 
9 cases oy? = E = 42.1 and for a sample of 5 cases o4? = m = 6.3. 


The value of F would then be 63 = 6.68. 


In Table 4 we find, for d.f., = 8 and df.» = 4, that an F of 6.04 
Would be exceeded in 5 per cent of all pairs of samples of this-size 


1R. А. Fisher, Statistical Methods for Research Workers, Sixth Edition, Table VI, 
Pages 248-253, or Statistical Tables, by Fisher and Yates, pages 28-35. (The latter 
reference also contains tables for the variance ratio (F), and gives the 20 per cent 
and o.1 per cent points, as well as the 5 per cent and 1 per cent.) 

* б. W. Snedecor, Statistical Methods, Table 10.2, рр. 174-177. 
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combination if the samples were drawn from equally variable pop- 
ulations, and that a value of 14.80 would be exceeded 1 per cent 
of the time. If our hypothesis were true, we would secure an F as 
large as 6.68 between 1 per cent and 5 per cent of the time just as a 
result of chance fluctuations from sample to sample. Whether or 
not we would describe the difference in variability of these samples 
as "significant" would then depend upon the “level of significance" 
which we chose arbitrarily to employ. 


7. THE SIGNIFICANCE OF THE MEAN OF A SAMPLE CONSISTING OF 
RANDOMLY SELECTED INTACT GROUPS 

We have noted in Chapter I that small sample error theory is of 
particular importance in educational research because so many of 
our samples consist of a small number of intact and relatively hom- 
ogeneous subgroups. Usually each of these groups consists of the 
pupils in a single class, or under a single teacher, or in a single school 
orcommunity. If these groups have been selected at random from 
all such groups in the population, it is possible to determine the sig- 
nificance of the obtained mean of the total sample by the procedure 
presented in the following illustration. 

This illustration is based upon actual data obtained from the 
1938 Iowa Every-Pupil Testing Program. Table 5 presents the 
distributions of scores on a certain achievement test in English Cor- 
rectness for the ninth-grade pupils in тт Iowa high schools. These 
Schools were selected at random from all schools of 65 to 125 en- 
rollment that participated in the 1938 program. We note that a 
total of 414 pupils was tested in these тт schools, and that the mean 
and standard deviation of the combined distribution are 164.3 and 
29.3 respectively. If we were to consider this as a random sample 
of 414 pupils, we would estimate the standard error of the mean as 

29. 
n = 144, and hence we would say, at the т per cent level of 
confidence, that the true mean of the population lies between 
160.48 and 168.12 (i.e., not more than 2.576 standard errors from 
the obtained mean). 
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This sample, however, is not a random sample of pupils, even 
though it is a random sample of schools. It is almost self-evident 
that a sample of, say, 37 pupils all selected from school No. 1 would 
not be as good a basis for generalization as a sample of 37 cases 
selected at random from all pupils in the entire population. In the 
random sample many different schools would be represented, and 
the mean would not be unduly influenced by the systematic supe- 


TABLE 5 
DISTRIBUTIONS OF SCORES ON THE 1938 Iowa EVERY-PUPIL TEST IN 
ENGLISH CORRECTNESS FOR THE NINTH-GRADE PUPILS IN 11 Iowa 
Нісн SCHOOLS 


Schools 


Scores 
6 


“ 


230-239 
220-229 
210-219 
200-209 
100-19! 
180-189 
170-179 
160-169 
150-159 
140-149 
130-139 
120-129 
110-119 
100-109 
Ep 
о" 69 
oe 79 
um of 
Scores " 7 .5 [5592.0 [6621.0 [5971.0 |5533-0 
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Mean of School Means = 163.8; Mean of o’s = 26.0 


riority or inferiority of any one school. It should be equally evident 
that a sample of 414 pupils from тт schools is not as good as a ran- 
dom sample of 414 cases from all schools in the whole population. 

This sample, then, should be considered as consisting of 11 
schools rather than of 414 pupils. The mean of the sample should 
be considered as a weighted mean of 11 school means, rather than 
of 414 pupil scores. The significance of this mean should be deter- 
mined from the distribution of the 11 school means by means of 


the t-test as for a sample of тт cases. 
If the number of cases is the same or very nearly the same for all 
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schools, it will be convenient to use as the mean of the sample the 
unweighted mean of the school means. The /-test may then be 
applied as in the following illustration, based on the data of Tak: 5. 


School Deviations from Deviations | 
Means Unweighted Mean Squared 
173.4 9.6 92.16 
189.6 25.8 665.64 
149.5 — 14.3 204.49 
I53.3 — 10.5 IIO.25 
176.6 12.8 163.84 
148.2 — 15.6 243.36 
161.9 — 19 3.61 
155.3 = 85 72.25 
174.2 IO.4 108.16 
157.1 = 69 44.89 
162.7 = rl I.21 

Total 1801.8 Sum of squared d's = 1709.86 

Unweighted 
Mean 163.8 


: e 1709.86 
est’d Oy = Viu 394 


Tf, as usually happens, the number of cases varies considerably 
from school to school, it will be necessary to deal with the weighted 
mean, and also to weight each squared deviation (of school mean 
from this weighted mean) by the ratio between the number of cases 
in the corresponding school and the average number of cases in all 


schools. In this instance, for example, the average number of 


pupils per school is 414/11 = 37.65. Hence, the number of cases in 
School No. т is 37/37.65 = .9827 times as large as in the average 
school For School No. 2 this ratio is 39/37-6 = 1.0358, for School 
No. 5 it is 44/37.6 = 1.6866, etc. The squared deviation of each 
School mean is then multiplied by the corresponding ratio, and 
these products are added to secure the sum of the weighted squared 
deviations. From this sum, the standard error of the mean is 


computed as before. These steps are illustrated below for the 
data of Table s. 
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School Deviations from Deviations 
Means Weighted Mean Squared Ratio Weighted d?'s 
173.4 9.1 82.81 .9827 81.38 
189.6 25.3 640.09 1.0358 663.01 
149.5 — 14.8 219.04 +9562 209.45 
153.3 II.0 I2I 00 .9031 109.28 
176.6 — 12.3 151.20 1.6866 255.17 
148.2 — 16.1 259.21 :9296 240.96 
161.9 — 2. 5.76 1.1421 6.58 
155.3 — 9.0 81.00 -9562 77-45 
174.2 9.9 98.01 1.0093 98.92 
157.1 = 7.2 51.84 1.0093 52.32 
162.7 = 1.6 2.56 .go3I 2.31 
Sum of weighted squared d's = 1796.83 
Weighted 
Mean 164.3 


/ 1796.8, 
est’d Oy = =_= E: = 4.03 


168.12 — 164.3 = 
4.03 ? 


For M, = 168.12, t = .97 


For My = 177, t= 3.15 

The method of computation that was used in the preceding ex- 
ample is not the easiest to employ, and was presented only to show 
thc essential nature of the procedure. A more convenient method 
of computation is to multiply each school mean by the total of the 
Scores on which it is based, to add these products, and to subtract 
from this sum the product of the mean of the total distribution and 
the sum of all scores. This result should then be divided by the 
Product of the total number of pupils and the number of degrees 
of freedom (one less than the number of schools) to yield the square 
of the standard error of the mean." The square root of this result 

* This procedure may be expressed in terms of a formula, as 


, ZM,T,—GM-GT 
esdex—-- wg) 


in which Jf, represents the mean for any school, Ту the corresponding sum of pupil 
Scores, © M,T, the sum of the MT products for all schools, GM the general mean for 
all schools, GT the grand total of pupil scores, V the total number of pupils, and 7 
the number of schools. The theoretical basis for this method of computacion will be 
explained later in the discussion of analysis of variance, pages 87 to 92. 
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will then be the desired standard error. In this example, using the 
data from Table 5, we find the sum of all products of means and 
totals, as follows: 


(6416.5)(173.4) + (7395.5)(180.6) + (5382.0)(149.5) + +++ etc. 
= 11,246,655.75 
We then find the product of the total and mean for the combined 
distribution and subtract this product from the result just ob- 
tained, as follows: 


11,246,655.75 — (68043.0)(164.3) = 67190.85 


If this is done on a Monroe calculator, or a similar computing 
machine, this result can be obtained without any transcription. 
The procedure is to secure the first product, then leave it in the 
lower dial while the second product is secured, etc., until all prod- 
ucts of totals and means for individual schools have been cumu- 
lated. The product of total and mean for the combined distribu- 
tion may then be subtracted from the result already in the machine 
by “multiplying backward.” 

We now divide this result by 414 X 10 = 4140, to secure 


est'd o; = 07290.85 = 16.2296 
4140 


The square root of this result is the estimated standard error of 
the mean, in this case 


est'd oy = V 16.2296 — 4.03, 


which is the same result obtained before by the more laborious 
computational procedure. 

We note that in the case of this illustrative problem the results 
secured from the unweighted data were approximately the same a$ 
those secured from the weighted data. This is because the num- 
ber of cases was nearly constant for the schools used in the illus- 
tration. Ordinarily, of course, the latter computational procedure 
should be employed. 


We mr y now see how invalid were the results secured (page 66) 
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when the whole sample was considered as a random sample of 414 
pupils. This false assumption led to an estimate of 1.44 as the 
standard error of the mean, whereas the more valid estimate is 
4-03. On the basis of the assumption of random sampling, we 
reasoned that the true mean could hardly exceed 168.12. When 
we use the more valid estimate of the standard error and apply the 
t-test, we find that for an M; even as large as 177 the value of ? is 
only 3.15. For ro degrees of freedom, this large a value of / just 
fails to be significant at the т per cent level. The mean of this 
sample, then, is not nearly as reliable as the mean of a sample of 
414 pupils selected at random, and the application of the usual 
large sample techniques which assume random sampling would be 
seriously misleading. 

It should be noted that an underlying assumption in the /-test is 
that the measures from which ¢ is computed are a random sample 
Írom a normal population. In the case just considered, this im- 
plies that the rr school means must be considered as a random 
sample of the means of all such schools in the state, and assumes 
that these means will be normally distributed. This latter as- 
sumption may be fairly well satisfied if the individual schools are 
of approximately the same size, but it is less likely to be if they 
differ widely in size. There is reason to believe, however, that the 
t-test is reasonably valid even though the form of distribution for 
the population sampled differs considerably from that of the nor- 
mal curve. А 

This method also assumes that the variability within schools is 
fundamentally constant from school to school; that is, it assumes 
that the differences in variability from school to school are no 
larger than would be found in random samples of the same size. 
This will be given further consideration in the discussion of 
analysis of variance. 

It should be noted also that if the number of schools is very 
small, there will be a considerable loss due to the small number of 
degrees of freedom upon which the ! is based. (This will not be 
Serious if the number of schools is то or larger, since, as we may 
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note in the table for t, the critical value of t for any given level of 
significance does not decrease markedly for further increases in 
df.) 

It may be worth observing that the foregoing considerations are 
of particular significance in the evaluation of a norm established 
for a standardized test.: Many current norms, while based on 
large numbers of pupils, include pupils from only a very small 
number of schools. A norm of this type, even though based on 
several thousand pupils, may be no more reliable than one based 
on a truly random sample of only 50 or тоо pupils. 


8. THE SIGNIFICANCE OF A DIFFERENCE IN MEANS FOR SAMPLES 
EACH OF WHICH CONSISTS OF RELATIVELY 
HOMOGENEOUS SUBGROUPS 

When each of two samples is of the character of that of Table 5, 
and when the subgroups in one sample are independent of those in 
the other, the difference in means for the two samples may be 
evaluated in a manner Suggested in Section 4, pages 56-58. For 
the first sample, we would find the sum of weighted squared devia- 
tions of the means of the subgroups from the weighted mean of 
that sample. This would be done by (1) adding the products of 


larly be found for the second sample. The sum of thse two sums 
is then the same as (Z di + E d?) in formulas (10) and (11), and the 
rest of the procedure would be that suggested on page 57, ît, repre- 
senting the number of subgroups in the first sample, and т, the 
number of subgroups in the second. 

To illustrate this procedure, suppose that for two samples like 


1 Lindquist, E. F., “Factors Determining the Reliability of Test Norms." Journal 
of Educational Psychology, 21: 512-20 (October, 1928). 
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that of Table 5 the totals and means for the various schools are as 
follows: 


TOTALS AND MEANS FOR SUBGROUPS IN Two SAMPLES 


Sample r Sample 2 

N Total Mean School N Total 

I 3o 678 22.6 I 18 488 

2 26 840 32.3 2 3I 682 

3 52 1503 28.9 3 44 766 

4 16 358 22.4 4 17 502 

5 8 171 21.4 5 $2 1170 
NE 6 17 432 

Totals 132 3550 7 25 80) 7 

Weighted Mean — 26.89 Totals 204 4932 


Weighted Mean — 24.18 


The sum of products of school totals and means, minus the product 
of the grand total and weighted general mean for Sample т is 
2110.6. For Sample 2 the corresponding result is 6252.6. The 
е of weighted squared deviations (Z d?) for Sample 1 is then 
Aes = 79.95, and (Е d?) for Sample 2 is зс E = 214.55. 
We may then substitute in Formula (11) as follows: 

26.89 — 24.18 271 — 


ф= =——£= 85 
(28 + 214.55) ( 4 D 10.10 
5t7—2? 5 f 


We note in Table 3, for то degrees of freedom, that this is almost 
the value оғ; that would be exceeded 40 per cent of the time by 
Chance alone, Clearly this difference is not significant. 
Tf the subgroups in the two samples have been paired on some 
4515 such that there is a significant correlation between the means 
Of the paired subgroups, a different procedure must be followed. 
the numbers in all subgroups are the same (or very nearly so), 
We may deal with the unweighted means, following the procedure 
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suggested in Section 5, page 58. To illustrate, suppose we have 
the scores made on a test in world history in 6 high schools, that 
the numbers in each high school are very nearly the same, and that 
there is very nearly the same number of boys as of girls in each 
school. Suppose that the means are as given below, 


MEAN Scores or Boys AND GIRLS IN EACH SCHOOL SEPARATELY 


Differences 


We note at once that there is some correlation between the means 
for boys and girls. The school (#6) whose mean for boys is highest 
is also that whose mean for girls is highest, and the lowest mean 
for boys (#5) is also associated with the lowest for girls. Accord- 


ingly, the estimated Standard error of the mean of the differences 
(3.23) in school means for the sexes is 


est’d oy и = PET 


= I.I9 


and hence ¢ = 3-23/1.19 = 2.71. 

For 5 degrees of freedom, an absolute value of ¢ as large as 2.71 
would be found less than 5 per cent of the time if the true difference 
Were zero, or a positive value of t this large would be found less 
than 2.5 per cent of the time. Weare therefore fairly well justified 
in asserting that boys in general are superior to girls in general in 
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achievement in world history, although the evidence may not be 
а5 strong as we should like it to be. 

It will be interesting to compare these results with those that 
would be obtained by the method of Section 4, pages 56-58. In 
this case, the estimated standard error of the difference is 


" 180.56 = 92.35 fr, 1\_ 
esd oy, и, ر‎ BOSS (2+2) = зә 


This estimate is much larger than before (1.19) and і now becomes 
3-23/3.c0 = 1.08. 

For то degrees of freedom, this absolute value of ? would be ex- 
ceeded by chance alone at least 30 times in тоо. Hence, this test 
would make the difference appear much less significant than it 
Teally is, 

The first of these tests, then, takes advantage of the homoge- 
neity of achievement in individual schools. It should be noted, 
however, that this procedure again deals with unweighted means 
and that there is considerable loss due to the small number of 
degrees of freedom involved. A still more adequate test which 
may sometimes be employed in situations like this — a test which 
takes differences in size of school into consideration — will be sug- 
"pea later by the methods of analysis of variance (see pages 

7 f) 


CHAPTER IV 


THE IMPORTANCE OF DESIGN IN EDUCATIONAL 
EXPERIMENTS 


I. THE NEED FOR MEASURES OF PRECISION 
Tue fundamental purpose of most experimental research in 
education is to discover the effects upon the pupil of specific varia- 
tions in his environment or training. The typical procedure in 
such experiments is (1) to select two or more groups of pupils, each 
of which is presumably representative of some defined population 
about which generalizations are to be established, (2) to subject 
each of these groups to one of a number of prescribed “ treat- 
ments,” (3) to secure criterion measures of the final status or 
change of status of each pupil with reference to the particular trait 
or traits which the treatments are intended to modify, and (4) to 
analyze and evaluate the results by means of statistical techniques. 

The precision of any such experiment may be defined as the 
degree to which the observed differences in results from group to 
group are due only to the differences which have been deliberately 
introduced into the "treatments." The precision of the experi- 
ment will then depend upon the success with which all factors 
which might otherwise affect the results, other than the deliberate 
variations in “treatment,” have been controlled or equalized (or 
corrections made) from group to group, and upon the extent to 
which the criterion measures really measure the things which they 
are intended to measure. 

Absolute precision is, of course, impossible. The factors which 
may conceivably affect the results are ordinarily too numerous and 
complex even to permit the identification of all of them, to say 
nothing of their equalization or measurement, and the effects which 
it is desired to measure are often only vaguely defined, and may 
usually be measured only indirectly and with high fallibility. The 
results obtained from an experiment may therefore never be taken 
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at their face value, but must always be considered as only approxi- 
mate in character, or as likely to be in error by some indefinite 
amount. In other words, it is always possible that any observed 
difference in results is due, not to the treatment differences, but to 
uncontrolled and unmeasured variations in factors extraneous to 
the purposes of the experiment. 

While it is obviously impossible to determine the magnitude and 
direction-of these errors, it is sometimes possible to determine the 
Probability that the error arising from certain sources will exceed a 
given magnitude, and thus to estimate the maximum error that it 
1$ reasonable to suppose might arise from those sources. Allow- 
ance may then be made for this maximum error in the evaluation 
of results and, if the other sources of error have been adequately 
Controlled, the experiment may yet lead to sound and useful, con- 
clusions, 

It is extremely significant that unless one has a fairly definite and 
dependable (even though subjectively derived) estimate of the 
maximum error which might be present in an obtained difference, 
One can draw no useful conclusion from that difference, no matter 

OW precise it may be in reality. So long as the error may be of 
any magnitude, it is always conceivable that the difference is due 
to error alone. Given a dependable estimate of error, one can 
demonstrate that certain hypotheses are inconsistent with the 
results obtained; without any such estimate any hypothesis what- 
Sver is admissible, including of course the hypothesis that there are 
10 real differences in treatments. 

In a very real sense, then, it is more important to know the 
degree of precision of an experiment, whether high or low, than itis 
that the precision be in reality high. An observed difference in 
results may be of very low precision and yet reveal conclusively 
that there is a corresponding difference in treatments, if one can 
demonstrate objectively that the maximum error, however large, 
could not alone account for all of the observed difference. In 
Other words, a difference may be statistically significant even 
though very unreliable. On the other hand, an investigator may 
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so design and conduct an experiment that the degree of precision 
attained is in reality very high, but unless he knows and can con- 
vince others that this is the case, that is, unless he can set definite 
limits to the errors present, anyone may successfully contend that 
the observed differences, regardless of their magnitude, are in fact 
due to error alone. 

In designing an experiment, then, it is just as important to pro- 
vide for an objective and dependable estimate of error as it is to 
provide for high precision. In fact, no efforts to increase the pre- 
cision of an experiment will be of any avail unless one can also 
dependably describe the increased degree of precision attained. On 
the contrary, if by some device one eliminates а certain source of 
error, and yet continues to employ an estimate of error that still 
makes allowance for errors from that source, the experiment may 
even appear /ess conclusive than otherwise. This is because the 
observed differences may actually become smaller (since there is no 
longer any possibility of their being inflated by this particular 
source of error), but this fact will only make the available esti- 
mates of error appear larger in relation to the reduced differences. 
This is a mistake that has very frequently been made in educa- 
tional research, as will be shown by illustrations later. 

It should therefore be a maxim of experimental design that if a 
given source of error cannot be eliminated both from the experi- 
mental results and from the estimates of error, it had better not be 
eliminated at all. In other words, it may sometimes be desirable 
to select an experimental design that will lead to lower precision 
than another, if the first design will permit а valid estimate of 
error and the second will not. 


2. SOURCES OF ERROR IN EDUCATIONAL METHODS EXPERIMENTS 

With reference to this problem of the estimation of error, we may 
distinguish between two major sources of error in experiments of 
the type considered. The first of these is the possibility that the 
experimental groups are so unlike one another in their ability to 
profit by any treatment which may be administered that the ob- 
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served differences in results are due entirely to differences in the 
groups themselves, rather than in the treatments received by 
them. The second is the possibility that, in spite of the precau- 
tions taken, other factors than those involved in the treatments 
may have been permitted to vary from group to group during the 
Course of the experiment, and that these uncontrolled variations 
alone or in part account for the differences observed. The first 
type of error arises from individual differences among the pupils 
Constituting the groups, the second from variations in outside 
factors affecting the groups as a whole. The first may be illus- 
trated in an experimental comparison of two methods of instruction 
by differences in the intelligence or learning ability of the pupils 
Within the experimental groups, or in the quality of their home 
environments, or in their established habits of study. The second 
may be illustrated in the same situation, by differences in the abil- 
ities of the two teachers selected to teach the experimental groups, 
or by differences in the circumstances attending the administration 
of the methods, such as the fact that one group was taught ina 
Poorly ventilated and the other in a well-ventilated classroom. 
Tn a single experiment, objective estimates may be derived only 
for errors of the first of these two types. This is because the only 
asis we have for the estimation of error is the mathematical theory 
Of Probability. In order to utilize this theory, it is essential that 
the “error” variables be distributed strictly at random with refer- 
£nce to the treatments compared. It is also essential that there be 
а number of observations for each treatment, since no statement 
of Probability can be based on a single observation. In a properly 
designed experiment, therefore, all errors related to pupil-variables 
may be readily taken into consideration. The pupils may be as- 
Signed at random to the treatments, and since there is a number 
of pupils in each group, it is possible to compute a measure of the 
Variability of results from pupil to pupil under each treatment, 
and this in turn will make possible an estimate of the variation in 
Means (or other derived measures) that would be expected for 
Other random groups of the same size and given the same treat- 
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ment. Errors arising from variations outside the groups, however, 
may not be so treated. For example, in a methods experiment in 
which there is only one teacher for each method, no estimate may 
be made of the degree to which the results would vary for other 
teachers using the same method, that is, there would be no possi- 
bility of discriminating between the error in results due to the 
teachers and the real differences due to the methods. In a single 
experiment then, the only recourse for the experimenter is to at- 
tempt to eliminate the second type of error entirely. 

While errors of the second type may, with proper care, be mark- 
edly reduced, they obviously cannot be eliminated. This has 
heretofore constituted one of thé principal limitations of experi- 
mental methods in education, since the methods of statistical 
analysis usually employed do not take this second type of error 
into adequate account in the estimate of error. 'To make matters 
worse, the methods of control over this source of error which have 
been employed have often rendered invalid the estimate of error 
due to pupil variables, and have sometimes reduced the magnitude 
of the observed differences without any reduction in the error 
estimate. 


3- TYPES OF EXPERIMENTAL DESIGNS 

To clarify the principles and observations which have thus far 
been presented in generalized form, it may be well to consider а 
number of concrete illustrations. Let us suppose that an experi- 
ment is to be designed to determine the relative effectiveness of 
two methods of teaching spelling to fourth-grade pupils, and let 
us consider specifically the relative merits and limitations of sev- 
eral experimental designs representative of those that have fre- 
quently been employed in educational research. We shall first 
describe briefly the essential feature of each design and then com- 
ment on its merits and limitations. In all illustrations we shall 
assume that the experiment is of the same duration, say 12 weeks, 
that the criterion is a list-dictation type of spelling test based on 4 
random sampling of the words taught, and that this test is in 
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all cases administered under the same conditions at the close of the 
experiment. It will also be assumed that the crucial comparison in 
each case will be a comparison of the mean scores on the criterion 
test for the experimental groups. The two methods, which need 
Dot be described, will be referred to as Methods À and B. 


Design I: Experiment conducted in a single school enrolling 48 
fourth-grade pupils. These pupils assigned strictly at random to 
two specially constituted classes of 24 pupils each. One class - 
taught by Method A, the other by Method B, but by different 
teachers, 

Comments: This design will permit a perfectly valid estimate 
9f the errors due to pupil-variables, such as differences in initial 
Spelling ability or differences in learning ability, but only of such 
errors. The appropriate error estimate and test of significance 
for the difference in means in this case would be that described in 
Section 3 of Chapter III (pages 54-5 5). The precision of the ex- 
Periment, even with reference only to errors of the first type, will 
is quitelow. Pupils are so variable in ability to profit by instruc- 
tion that random samples of this size are likely to differ widely in 
His respect and this difference in ability could alone account en- 
tirely for a difference in means much larger than is likely to be pro- 
duced by any two methods. Uncontrolled variations outside the 
EtOups, such as teacher differences, are of course entirely ignored 


i i 
n the estimate of error. 


Design IT: Same as Design I, except that 
Y same teacher. 
bi Comments: This design is presented only 
ility of reducing errors of the second type. Obviously, however, 
Sven in this case the teacher-variable is not eliminated since the 
teacher may strive harder to make one of the methods work, or 
may be prejudiced against one, or may be more familiar with one, 
etc. The error estimate (as in Design I) would be valid for total 
jr only if all errors of the second type were completely elim- 
ated, 


both groups are taught 


to illustrate the possi- 
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Design III: Experiment conducted in 10 schools. Five schools, 
selected at random, use Method A, the other five use Method B. 
Results are evaluated by pooling scores on criterion test in a single 
distribution for al! pupils under each method; by computing the 
standard error of the mean of each distribution according to for- 
mula (1) (page 12), and from these computing the standard error 
of the difference in means. Difference declared “significant” if 
three times its standard error. 

Comments: This is a design which has actually been employed 
quite often in educational research. The estimate of error is 
invalid, even with reference only to errors of the first type, since 
the samples are not random samples of pupils. An important 
source of error which was not present in Designs I and II, is the 
large systematic difference in achievement and other factors from 
school to school. This source of error may render this design less 
precise than either of the preceding, even though many more 
pupils are involved. A more valid estimate of error could be 
derived, using the methods of Section 8, Chapter III, or others to 
be suggested later, but this of course would not increase the pre- 
cision of the design. The estimate which assumes random sam- 
pling of pupils seriously underestimates the error and the results 


may therefore appear "significant" even though really due to 
error. 


Design IV: Experiment conducted in ro schools. Procedure in 
each school like that of Design II, but the results for all pupils un- 
der each method are pooled in a single distribution and the differ- 
ence in means evaluated as in Design III. 

Comment: This has been one of the most frequently used and 
generally approved designs in educational research. The precision 
of this design may be very much greater than that of Design ш, 
since school differences tend to be equalized by the device of using 
both methods in each school. It would in fact be a very efficient 
procedure if a valid estimate of error were employed. The esti- 
mate of error which is used assumes random sampling, but thes¢ 
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samples are likely to be more like one another than would random 
samples of the same size, since the same schools are equally repre- 
sented in each. In this case, therefore, the error estimate exag- 
gerates the magnitude of errors of the first type, but very largely 
ignores errors of the second type. While the precision of this 
design is much greater than that of Design III, the results are less 
likely to appear "significant" as judged by the estimate of error, 
Since the difference in means is likely to be smaller, but the esti- 
mate of error will be as large or larger than before (larger because 
the variability in scores of pupils from 1o schools is likely to be 
&reater than of those from 5). 

As will be shown later, this experiment should be considered as 
Consisting of то parallel experiments of the nature of Design II, 
Tather than as a single experiment. Tf so considered, and if certain 
Other conditions are satisfied, a much more valid estimate of error 


тау be derived for the combined experiments. 


. Design V: Conducted in one school. Pupils are given a prelim- 
inary test of spelling ability and two classes are organized such that 
they show as nearly as possible the same distribution of scores on 
this test. The standard error of the difference in means on the 
Criterion test is then computed by the special formula for differ- 
ences in means of matched samples, which takes into consideration 
the correlation between initial and final scores. 

Comments: This is the familiar “matched” or “equated” 
8toups type of experiment. Depending on the correlation be- 
tween initial and final scores, this design will result in higher pre- 
Cision than Design I, and the estimate of error will be highly valid 
(recognizing the increased precision) but only so far as errors of the 
first type are concerned. А disadvantage of the design is that the 
Initial test must be administered and scored before the groups can 

© organized, and then only with additional administrative difi- 
culties. A procedure will later be suggested which will eliminate 
this limitation, but which will yield equal precision and equal 


validity of the error-estimate. 
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Design VI: Duplicates Design V in each of то schools. Results 
pooled in a single distribution for each method and results evalu- 
ated as in single school of Design V. 

Comments: This is the most precise of the designs considered, 
but the estimate of error is decidedly invalid since it assumes that 
pupils with the same score on the initial test are assigned at random 
to the two methods. The estimate therefore exaggerates errors of 
the first type, and only indirectly and inadequately considers those 
of the second. 


4- VALID METHODS OF ANALYSIS OF EXPERIMENTAL DATA 

The foregoing illustrations should be sufficient to clarify the 
principles earlier suggested and to demonstrate the importance of 
selecting a design that will not only lead to high precision, but will 
also permit a valid estimate of error. It should be noted that one 
of the principal obstacles to the discovery of a satisfactory design 
was the existence of large systematic differences from school to 
School. Were it not for this factor, designs of the type of I, II, and 
V preceding could be extended to include enough pupils to secure 
any desired degree of reliability. For practical reasons, however, 
large samples for educational experiments can be secured only by 
combining a number of intact school groups or, otherwise stated; 
by duplicating the same experiment in a number of school situa- 
tions. When this is done, the total group under any treatment 
may no longer be considered as а random sample and the familiar 
random sampling techniques applied to the pooled results will no 
longer provide a valid estimate of error. What is needed, then, is 
Some means of collating the results from a series of duplicate ex- 
periments in such a way that the controls over “school” variables 
of the type described in Designs IV and VI may be taken into 
consideration in the estimate of error. 

The method of analysis which will satisfy this requirement is 
known as the analysis of variance. This method, which has been 
developed in recent years by R. A. Fisher and his students, repre- 
sents one of the most important contributions that has yet been 
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made to the techniques of experimental research. It has thus far 
scarcely been utilized in educational research, and its possibilities 
are just beginning to be appreciated. It seems destined, however, 
soon to become the standard procedure in experiments of the gen- 
eral type here considered. 

This method will be described in detail in the following chapter, 
but it may be well to draw attention in advance to some of its most 
important features. It is essentially a method of analyzing the 
results from a series of parallel or duplicated experiments, each of 
which is performed under more homogeneous conditions and with 
more homogeneous groups than prevail in the entire population 
involved. The estimate of error which it provides eliminates the 
effects of systematic differences (such as school-differences) from 
one to another of these duplicated experiments. It will also pro- 
vide an estimate of the errors due to factors (other than in the 
treatments) which create systematic differences in the experi- 
mental groups within each duplicated experiment, if these factors 
have been randomized within each experiment. For example, sup- 
pose that different teachers have taught the various classes in each 
School in a design like that of number IV in the illustrations just 
Considered. If the teachers in each school were randomly assigned 
to the methods, the analysis of variance would provide an esti- 
mate of error which allows for this uncontrolled teacher variable. 
The method will thus take into consideration in the estimate of 
error many important uncontrolled variables for which no estimate 
can be derived in a single experiment and will permit the utiliza- 
tion of many types of controls (over errors of the second type) 
which if utilized in a “single experiment" (of the nature of Design 
IV) only make the results appear less conclusive. 

An extension of the analysis of variance, known as the analysis 
of covariance, makes possible all of the precision of designs V and 
УТ, without requiring that the pupils in each school be actually 
equated with reference to the initial scores. It demands only that 
an initial measure be available for each pupil, but permits random 
division of the pupils into experimental groups in each school. 
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This means, of course, that the experimental groups may be or- 
ganized without waiting for the administration and scoring of a 
preliminary test and eliminates the difficulties which are met in 
attempts to “equate” small groups. 

There are many interesting and important applications of the 
analysis of variance and covariance which cannot be suggested 
here in advance of any detailed consideration of the nature of the 
methods. Perhaps enough has been said, however, to suggest to 
the student that these methods of analysis are deserving of his very 
careful consideration and that it will be worth while for him to 
persist in his study of them until every detail has been thoroughly 
mastered. 


CHAPTER V 
ANALYSIS OF VARIANCE 


I. THE FUNDAMENTAL THEOREM IN ANALYSIS OF VARIANCE 
Tur first step in the development of the methods of analysis 
of variance is to demonstrate that the variance of a large sample 
consisting of a number of equal groups may be analyzed into two 
components: the mean of the variances within the groups, and the 
variance of the group means. To demonstrate this fact let us sup- 
pose that we have a large sample of № cases which consists of r 
groups of 2 cases each. We shall call these groups group I, group 
2,...... ‚ group фу .----: up to group 7, letting 2 be а general 
term representing any group. We shall let M represent the mean 
of the large sample, or the general mean of the r groups taken to- 
gether, and let M, represent the mean of a single group. Finally, 
we shall let d; represent the deviation from M of a single measure 
in group $, and d, its deviation from the group mean М,. 

It will be observed that this notation is nearly the same as that 
employed in Section т of Chapter III. In this case, however, M 
represents the observed mean of the large sample, or the general 
mean of the r groups, rather than a population mean. This, of 
Course, results in а corresponding difference in the meaning of dj. 
Furthermore, we have not here assumed that the r groups are 
random samples from any population. 

Now, in exactly the same manner а 
ПІ, it may be shown that 

1 ا‎ опен (12) 

n N r 
The steps * in the derivation of (12) are algebraically identical with 
those in the derivation of (6) on page 49, and (12) differs from (6) 
h this derivation again, bearing in mind through- 


1 The student should work throug M п 1 
out the new meanings of d' and M, to satisfy himself that the change in notation does 


Dot disturb the logic. 


s in Section 1 of Chapter 


r 
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only in the meaning of d' and M. Let us now, by transposition 
of terms and change of signs, write (12) as 
241° 1. (50) (М-М 
х + x 
N T n r 
We may now note that, since d’ represents a deviation from the 
general mean and since Z d^? represents the summation of the 
squares of these deviations over all r groups, the left-hand term in 
(13) is simply the variance of the large sample of № cases. The 
first right-hand term in (13) is, as before, the mean of the group 
variances. The last term is the variance of the group means, 
since (M — M,) is the deviation of a single group mean from the 
mean of all group means, Z(M — M +) is the sum of the squares 
of these deviations, and 7 is their number. 

We have thus seen (13) that the variance of a large sample con- 
sisting of 7 smaller groups of z cases each may be analyzed into 
two components: the average variance within groups and the va- 
riance of the group means. We may remind ourselves that this 
is true for any collection of r groups of z cases each, i.e., it does 
not involve any assumption of random selection. It does, how- 
ever, suggest a test of the hypothesis that the ғ groups are ran- 
dom samples from the same population. 

For large samples, we are accustomed to using the с of the sam- 
ple as an estimate of the c of the population (or the variance of the 
sample as an estimate of the variance of the population). Ac- 
cordingly, if we had a number of large samples we could, under the 
hypothesis that they were all random samples of the same popula- 
tion, use the average variance of the samples as a better estimate 
of the population variance than the variance of any one sample 
alone. Similarly, again under the hypothesis of random sampling, 
we could use the observed variance of the means of these samples 
as an estimate of the variance of an infinite number of such means. 
But since the variance of a very large number of means of random 
samples of the same size is given by 


(13) 


2 
T 
Cn = 2t, 
n 
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we could also use э times the observed variance of means (no) 
as an estimate of the variance of the population. We would then 
have two independent estimates of the variance of the population, 
both derived from our set of z samples of z cases each. One would 
be based on the variance within groups, the other upon the variance 
of group means, or on the variance between groups. If our hy- 
pothesis were correct, these two estimates would differ only by 
Chance. If, then, we could show by means of the F-test that the 

‘ratio between these estimated variances is larger than chance 
would allow, we would have reason to believe that our hypothesis 
is false. 

By way of illustration, suppose we had 4o samples of 75 cases 
each, that we did not know that these were random samples from 
the same population, but that we wished to test the hypothesis 
that they were. To apply the test suggested in the preceding para- 
graphs, we would first compute the variance of each sample sepa- 
rately, and then find the mean of these до variances. The result 
would serve as one of our estimates of the population variance. 
We would next make up a distribution of the means of these sam- 
ples, and compute the variance of this distribution. We would 
then multiply this variance by 75 to secure another estimate of 
the population variance. We would then compute the ratio (F) 
between these two estimates of the population variance. If the 
F then proved significant we would have to reject our hypothesis, 
Le. we would say that the variance of the sample means is larger 
than chance would allow in random sampling. 

The test just described would not be exact, since we know that, 
particularly for small samples, the с of the sample is not a good 
estimate of ср, or the sample variance is not a good estimate of 
thetrue variance. Neither, for a small number of samples, would 
the observed variance of their means be a good estimate of the 
true variance of such means. The preceding paragraphs, then, 
only suggest a valid test of our hypothesis. However, if both 2 
and 7 are small, we can readily secure similar but more valid esti- 


mates of the true variance by means of formula (7) (page 50). 
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According to (7), the best estimate of the true variance which can 


. Sa 
be computed from a single small group is = 


these estimates for all r groups would of course constitute a still 
better estimate of the true variance. This average would be 


The average of 


AE in 
r(n— 1)' 
which Z d^ represents the grand sum, for all groups, of the squared 
deviations of the individual measures, each from the mean of the 
group to which it belongs. Similarly, the best estimate of the true 
z “= ay 
qf 


d 
Ix (22), which could be more simply written 


ғ п 


variance of the group means would be , and from 


this the best estimate of the variance of the population would be 
nZ(M-M. ». 
r-i 
We now have, under our hypothesis, two estimates of the popu- 


lation variance that are valid even though r and z are both small. 
These estimates are 


2nZ(M-M (based on variance of group 


est'd 0$ = = means or variance between (14) 
groups) 
and 
ale zd (based on variance within 5 
est'd т» = 7(n — 1) groups) (5) 


The number of degrees of freedom for the first estimate is (r — 1), 
and for the second is (и — 1). The second estimate is the average 
of the estimates for the individual groups and since each involves 
(п — x) degrees of freedom, the average of r of.them will involve 
r(n — т) df. 

These estimates will of course differ by chance, even though the 
hypothesis is true. However, they should not differ “signifi- 
cantly.” That is, if the ratio (F) between these estimates proves 
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significant, we may conclude that our hypothesis is false, or that 
the differences in group means may not be entirely attributed to 
fluctuations in random sampling. In other words, in the event of . 
a significant F, we may conclude that there are some real differences 
between the groups. 

The foregoing presents the essentials of the logic involved in 
general in the methods of analysis of variance. The basic proposi- 
tion is that from any set of r groups of n cases each, we may, on the 
hypothesis that all groups are random samples from the same popula- 
tion, derive two independent estimates of the population variance, one 
of which is based on the variance of group means, the other on the 
average veriance within groups. The test of this hypothesis then 
consists of determining whether or not the ratio (F) between these 
estimates lies below the value in the table for F that corresponds to the 
Selected level of significance. 

The application of the basic propo 


in various experimental designs presen 
problems, and the more important of these subsidiary problems 
will be discussed later with reference to concrete illustrations. 
Before going on to any consideration of these applications, how- 
ever, it is essential that the student arrive at a thorough under- 
Standing of what has been presented in this section and in Section 
т of Chapter III. There are some things in statistical theory that 
the research worker can afford to take on faith, but the proof of 
the basis proposition in analysis of variance is not among them. 
Unless this proof is fully understood, there is little possibility of 
intelligent application of the methods to be described in the fol- 


lowing sections. 


sition of analysis of variance 
ts many other detailed 


Note on Computational Procedure: 

We have seen, in (14) and (15), that the terms Z d? and 
n (М— My represent the basic quantities needed for our test 
of significance. It may be well to insert here а description of the 
manner in which these quantities may be most conveniently com- 


puted. We may note first that 


Zé-Zd -aZXO - М). (16) 
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The proof of this equality was given in Section т of Chapter їп, 
since (16) is identical, except for the meaning of d’ and M, with 
(5) on page 49. 

Now d’ represents the deviation of any measure from the gen- 
eral mean. Hence, if X represents any measure, d’ = X — M , and 

0° = X? — 2 MX + M*. 
Summing these expressions for all V measures, we have 
Zd^-ZX:-2MZX-NM. 
ButZ X = NM, hence 
24° = ХХ – 2 ММ ХМ: = У Хз – Мз 

It will help avoid confusion later to let СТ represent the grand 
total of all the measures (СТ = NM), and to let GV. instead of 
M, represent the general mean. With this notation, we may write 

zd? = È X? — GT-GM. (17) 


Hence, to compute Z d", we need only square each of the V 
measures, sum these squares, and subtract the product of the grand 
total and the general mean. On an automatic electric computing 
machine, the X?s may be cumulated as they are secured, and the 
grand total may be secured along with the sum of the squares. 


It may be shown similarly * that the term Z(M — M) in (16) 
is equal to 


Z(GM — М, = (M: + M24 LLL + Mi) -GM-2M, 
and that therefore 


E. GT-GM, (18) 


n 


in which T, refers to the sum of the measures for a single group. 
The equality in (18) may also be written as follows: 


2 0 (GM — M,) = (TM, + TIM, + +++ + TM) - GT-GM.. (182) 


This form of the equality will be more convenient to use than (18) 
in certain types of situations in which the number of cases varies 
from group to group. 

Expressions (17) and (18) or (18a), as we shall see later, will 
markedly simplify the computation of the terms needed in our final 
tests of significance. Once E d? and n (СМ — М ») have been 
computed by (17) and (18), we may readily secure Z d: by sub- 
traction according to (16). 


7 The full proof of (18) and (18а) is left as an exercise for the student. 


ANALYSIS OF RESULTS IN A SIMPLE METHODS EXPERIMENT 93 


2. ANALYSIS INTO TWO COMPONENTS: ANALYSIS OF RESULTS 

IN A SIMPLE METHODS EXPERIMENT 
The methods of analysis of variance will perhaps be most fre- 
quently applied in educational research to analyze the results of 
"methods" experiments. The variations in these methods of 
analysis will therefore be illustrated in terms of such experiments, 
although other possible applications will be suggested later. Our 
first illustration will be of the relatively simple design which may 
be used when the experiment is to be performed in a single school. 
Suppose, then, that in a certain school we have conducted an 
experiment to determine the relative effectiveness of four different 
methods of instruction for a given unit of content. To make the 
arithmetic of the illustration simple, we will say that the experi- 
ment has involved just 2o pupils. These pupils were originally 
assigned at random to 4 groups of 5 pupils each, one of which was 
taught by Method A, another by Method B, etc. At the close 
of the experiment, the same criterion test of achievement was 
administered to all pupils. The scores on this criterion test are 

arranged in tabular form as illustrated in Table 6. 


TABLE 6 
CRITERION SCORES IN A SIMPLE 
(Hypothetical Illustrative Data) 


METHODS EXPERIMENT 


Methods (or Groups) 


Totals (Т) 23 36 12 18 89.00 = Grand Total 
Means (M, 6 2 2.4 3.6 4.45 = General Mean 
(ry) 4 7 (GM) 


| The purpose of our analysis will be to determine whether the 
differences in means for the various methods are significant of 
real differences, or may be explained away in terms of chance 
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fluctuations in random sampling. In other words, we wish to 
test the hypothesis that the four groups of scores are random 
samples from the same population. We must then, according to 
the logic of the preceding section, derive from these data two in- 
dependent estimates of the variance of this hypothetical popula- 
tion. 

The first of these estimates will be based on the group means, 
as computed by (14). This means that we must first compute 
n Z(GM — M,)?, noting that in this case M › is a general term for 
any of the four group means. According to (18), 


Ti+ T?+T? 2 
n Z(GM — My = TUER TS —(GP „би 


_ 529 + 1296 + 144 + 324 
5 
— 458.60 — 396.05 — 62.55 
According to (14), this result, divided by r - 1 = 4 - 1 = 3, will 
constitute our first estimated variance. For simplicity, we will 


call this result the “variance for methods,” since it is based on the 
means of the methods groups. 


Our second estimate will be based on the variance within 
groups, as computed by (15). This means that we must first com- 
pute 2 d*, which, according to (16), is equal to Х d^ — nZ(GM — 
My. Now according to (17), 

24% = X: -GM -GT 
= 497 — 396.05 = 100.95 
This result was obtained by first computing 2 X?, which is the 
sum of the squares of the 20 individual scores. That is, 
ZX = + P+ et yr ete СЯ + 4? + 2? = 497. 
Now, according to (16), 


— 89.00 X 4.45 


Z @ = 100,95 — 62.55 — 38.40 
According to (15), this result, divided by r(n — т) = 4(5 — 1) = 16, 
18 our second estimated variance. We will call this the “variance 
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within groups" for simplicity in reference, although strictly it is 
an estimate of the hypothetical population variance based on the 


variance within groups. 
In practical computation, it is convenient to arrange these re- 


sults in tabular form, as follows: 


Sum of 
Squares Variance 


Methods 
Within Groups 


Total 


This is the standard form in which we will hereafter array the 
results of an analysis of variance. “Sum of squares" is simply a 
convenient abbreviation, which really has different meanings for 
different rows in the table. The number entered in the first row 
ш the *sum of squares" column is really n Z(GM — My, or т 
times the sum of the squared deviations of the methods means 
а The sum of squares for “within groups” 
15 the sum of the squared deviations of the individual scores, each 
from the mean of the group to which it belongs, or Х d*. (This 
Sum of squares was secured by subtraction, but the student may 
find it instructive to check the result by finding the difference be- 
tween each score and the mean of the group to which it belongs, 
squaring these 20 differences, and finding their total.) The sum 
оѓ Squares for “total” is the sum of the squared deviations of the 
individual scores from the general mean, or Zd”. It is con- 
Venient to use “sum of squares” to denote briefly all of these 
things, but the student should guard against interpreting it 
literally as a sum of squared scores or means. 

We are now ready to apply the test of significance. The ratio 
(F) between the methods and within groups variances is 20.85/ 


2.40 = 8.69. Now we find, entering Table 4 with 3 and 16 df., 
ignificant at the 1 per cent 


from the general mean. 
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chance alone. This does not mean, however, that the differences 
in methods means are necessarily due to differences in the relative 
merits of the methods themselves. All that we have done is to 
demonstrate that the differences in means could not reasonably 
be attributed to chance differences in pupil ability resulting from 
the random division of the total sample into the methods groups. 
The observed differences in means, although not due to chance in 
this sense, may be due to some uncontrolled factor, such as the 
tcacher-variable, particularly if different teachers had taught the 
different groups. The influence of such uncontrolled irrelevant 
factors may under certain conditions be taken into consideration” 
in the test of significance in other types of experimental designs, 
but not in a simple design like that employed in this illustration. 

Neither does the F-test signify that all methods differences are 
significant, or that the methods means are ranked in the order of 
the merits of the methods, even disregarding the possible effect 
of uncontrolled irrelevant factors, It may be, for example, that 
three of the methods are equally effective, and that the large F 
(or large methods variance) resulted only from the superiority of 
one method. We would still have to make a separate test, then, 
of any particular difference in which we were interested. 

To test the significance of any particular difference, such as the 
difference in means for methods A and C, we may apply the t- 
test to that difference. This test, as we will recall from Section 2 
of Chapter III, is a test of the hypothesis that both groups were 
selected at random from the same population, and requires that 
we have some estimate of the с of this hypothetical population. 
In formula (10), page 57, this estimate was obtained by pooling 
the squared deviations from both groups. In this case, however, 
we can get a still better estimate by pooling the sums of squared 
deviations from all methods groups. If we assume that the 
variance within a group differs from group to group only by chance, 
then the “variance within groups” is our best estimate of the pop- 
ulation variance, and its Square root is our best estimate of the 
Ctp needed in the formula for the standard error of a mean. 
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Hence, the best estimate of the standard error of a single methods 


mean is 
est'd c. 2.40 
aide, == ee Pes 
3 М» 5 93 


To find the standard error of any other methods mean, we divide 
the variance within groups by the number of cases on which the 
mean is based, and extract the square root of the result. In this 
Case, since each methods mean is based on the same number of 
Cases, each has the same standard error. It should be noted that 
= estimate of the standard error of a group mean is based on the 
e r(n — т) degrees of freedom on which the variance within 
groups is based. 
oe standard error of a difference between any two 
s means, such as for A and C, would then be 


est'd Oy, -itg = Air, Tog = Ale тй, 
= 1.414 Oy, = 1:414 X .693 = .980 


variance within groups, 


Again this estimate, being based on the 
lue of / for the difference 


i 
volves r(n — т) = 1627. Hence, the va 
etween two methods means, such as A and C, is 


M,-— Mc 4.6 = 24 . san 


est'd oy, -itc .98 
For 16 d.f., this tis significant at the 5 per cent, but no 
Cent level. 
а Since all methods means аге equally reliable, we may, if we wish, 
etermine in general what minimum difference between any two 
Means will be significant at any given level. For example, at the 
I per cent level, ¢ must exceed 2.921 for 16 df. Hence, from 
M,- М, 
.98 à 
xceed 2.921 X .980 = 2.85. 
A and C, A and D, В and 
The mean for B, however, 


tat the 2 per 


2.921 — 


т find that the difference must € 
4 € differences between the means for 
› and C and D, are not this large. 
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exceeds those for C and D by more than this amount. Hence, 
these are the only differences which are significant at the т per cent 
level. 

The student may well ask why it was necessary to apply the Ё- 
test at all, if we were subsequently going to test the individual dif- 
ferences anyway. The answer is that the F-test tells us whether 
it is worth while to test the individual differences at all. If the 
F-test had not proven significant, we would have known at once 
that all observed differences in methods means could be due to 
chance alone. In that case it would not only have been unneces- 
sary, but decidedly improper* to apply the t-test to individual 
differences. 

It may be well to observe here that the F-test which is applied 
to the methods and within groups variances may be considered as 
essentially a way of applying the /-test to all differences in methods 
means simultaneously. If there are only two methods groups, 
the F-test becomes the algebraic equivalent of the /-test; in this 
case F = Ё, as the student may note by comparing the values for 
F (at the 1 per cent level) in the first column of Table 4 with the 
corresponding values for і in the last column of Table + 


* This is particularly true if all of the methods аге qualitatively similar, and if the 
experiment was not designed for any special comparison of two of the methods. An 
exception to this rule might be defended, however, if one of the methods exhibits 
marked qualitative differences from all of the others. For example, in an experiment 
concerned with the relative effectiveness of five types of review procedures upon ге- 
tention of a “lesson” in geography, four of the review procedures consisted of a 307 
minute re-reading of the original lesson. These four procedures differed only in the 
distribution of time of re-reading: one 30-minute period, two 15-minute periods, etc- 
The fifth review procedure consisted of a 30-minute written exercise in objective test 
form. е The mean scores on a delayed recall test for the four re-reading groups dif- 
fered little among themselves, but the mean criterion score for the objective-drill 
group differed appreciably from the others. The F-test involving all five means fell 
short of the 5 per cent level of significance. The four re-reading groups were then 
combined into a single group, whose mean was compared with that of the objective- 
drill group. The t-test showed this difference to be significant beyond the т per cent 
level. This procedure was legitimate in this case, because the special comparison 
was suggested by the qualitative characteristics of the methods, and not just by an 
inspection of the final means. Had all five methods been qualitatively similar, 
however, it would ot have been legitimate to select the method with the highest (or 
lowest) criterion mean for a special comparison with one or with a combination О 
the others. (See Fisher, Design of Experiments, p. 65.) 
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Ns pera to note that the evaluation of individual group 
бо сы à the individual differences between them involves 
вы ген еа ion that the variance within a group is the same, ех- 
не? : ance, from group to group. In other words, we assume 
Sid pum factor has resulted in significant differences in group 
Bao TF а also result. in significant differences in group vari- 
ilem e the group variances do differ fundamentally, then, of 
Soup с standard error of the mean will differ from group to 
nae b. 2 rupes all groups are of the same size, and it would 
eu ац to compute the standard error of a mean from the 
ике! ‚т groups." This assumption, known as the as- 
й te : о n homogeneity of variance, may not be very well satisfied 
school bose methods experiment, particularly when several 
which dern involved. It is quite conceivable that the same factor 
ттн some schools or methods to produce higher mean 
lo pode ent than others may also cause some schools or methods 
Perese ce more variable achievement than others. Evidence will 
TV Rasa cn later in this chapter that the variance 1n educational 
itis al ent within schools is in fact not constant, but fortunately 
the in Possible to show that this fact does not seriously disturb 
s est. of significance in the type of design most frequently 

ployed in educational experiments," and which will later be dis- 


of homogeneous variance may be ap- 
by M. S. Bartlett that if s. gf 


.3026 
" 255 (п logo 5^ — Z "i logo 52) > 
1n which 
1 i4 
- EEN [(02)- = 
с FER J ( 3) 2), 


isa s 
proximately distributed as x? with 


hei P 
n all samples are of the same size, +++ = mp, we have 
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3. RULES FOR ANALYZING THE RESULTS OF A SIMPLE METHODS 
EXPERIMENT | 

It may now be well to summarize the procedure for analyzing 
the results of a simple methods experiment in terms of a series of 
brief rules which may conveniently be followed in practice. By a 
"simple methods experiment," we mean one in which the methods 
groups are selected at random from the same large group at the 
beginning of the experiment — а condition which usually can be 
satisfied only when the experiment is performed in a single school 
— and in which the same criterion test is given to all pupils at the 
close of the experiment. To analyze these criterion scores by the 
methods of analysis of variance, the steps in the procedure are 
as follows: 

т. Compute the total (T) and the mean (M) of criterion scores for 
each methods group separately. 

2. Compute the grand total (GT) and the general mean (GM) for 
all scores taken together. (Compute GM to at least five sig- 
nificant digits.)' 

3. Square each individual score, and total these squares for all 
groups to secure Ж X°; then subtract GT - GM to secure the 
sum of squares for TOTAL (È d^ = 3 X^ СТ - GM). (Round 
this result to five significant digits.) (If the total number of 


and hence 
6.9078 n 
X= ЕТ ( log; 5^2 — п; Z logs 52). 
For example, in Table 6 on Page 93, ?, = 5, n= 20, k= 4, 


13.2 , 
з= "wo 99 58 = 3.7, SË = 1.3, sb = 1.3, from which 
уз = 33+ ert L3d rg "s 
Hence 


6.9078 
Xia ооб [(20)(0.38021) — s(o.51851 + 0.56820-+ 0.11394 + 0.11394)] = 2-19 


This value of x* is obviously not significant for 3 d.f., hence we are justified in re- 
taining the hypothesis of homogeneous variance, and on this hypothesis applying the 
F-test of significance as on page 9s. 

* See E. F. Lindquist, A First Course in Statistics, pp. 61-66, for a discussion of 
significant digits. 
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scores is quite large, or if a computing machine is not avail- 
able, the most convenient way of securing the sum of squares 
for total is to prepare a grouped frequency distribution of all 
scores, and to employ the “short” method, using an arbitrary 
reference point. The formula needed is 


in which © d^ is the sum of squares for total, iis the size of the 
interval, d is the deviation of the midpoint of any interval 
from the arbitrary reference point and f the frequency in that 
interval, and in which 2 fd” is the algebraic sum of the prod- 


ucts fd") 
- Square the total for each methods group separately, add these 
cases in a single 


squares, divide their sum by the number of 
methods group, and subtract СТ. GM to secure the sum of 
squares for METHODS. (See (18), page g2.) (Carry this result 
to the same number of decimal places as in the rounded sum 


of squares for total.) 

- Subtract the sum of squares for METHO 
to secure the sum of squares for WITHIN GROUPS. 

- Arrange these results in tabular form as indicated below, and 
divide the sum of squares for METHODS and WITHIN GROUPS 
by the corresponding d.f.’s to secure the corresponding variances. 
The d.f. for methods is one less than the number of methods, 
and for total is one less than the total number of pupils. The 


d.f. for within groups may then be found by subtracting the 
d.f. for methods from that for total. 


ps from that for TOTAL 


af. Sums of Squares Variances 
Methods: assy Core) vo, c = 
Within Groups 
(eni) о ex йб” ЫШЫ UL IUE uns 


Total Ap) + ЖЫН" eer 
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(Note: The within groups variance is often referred to as the 
error variance in this design, since the standard errors of the 
means are based on this variance. However, the within 
groups variance is not always used as the error variance, as 
we shall see later in connection with other designs.) 

7. Find the ratio (F) of the METHODS and ERROR variances, and 
turn to Table 4 to see if this ratio is significant. (If the methods 
variance happens to be less than error variance, it is obvious 
that there is no significant difference, and no reference need 
be made to the table.) If the F is not significant, the analysis 
is usually concluded, since there is then usually no need to 
investigate individual differences. (See footnote r on page 
98.) 

8. If the F is significant, and if more than two methods are involved, 
the individual differences may be evaluated by the t-test. The 
standard error of any mean may be estimated by dividing the 
error variance by the number of cases on which that mean is 
based and extracting the square root of the result. The 
estimated standard error of the difference in any two inde- 
pendent means is the square root of the sum of the squares 
of the two standard errors of the means involved. If all 
methods groups are of the same size, the estimated standard 
error of any difference is 1.414 times the standard error of any 
mean. The number of degrees of freedom for t in any test 
involving these estimated standard errors is the same as the 
df. for the error variance. If all methods groups are of the 
same size, the minimum value of any difference which will 
be significant at a selected leyel may be found by finding in 
Table 3 the value of / needed at that level, and multiplying 
this value of ¢ by the estimated standard error of the 
difference. 

Note: These rules are applicable even though the number of cases 
may not be the same for all methods groups, so long as these groups 
were originally selected at random from the same population. In 
this case we would compute the sum of squares for methods ac- 
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cording to (18a) on page 92, rather than according to (18), since л 
is no longer constant from group to group. According to (18a); 
the sum of squares for methods is the sum of the weighted squared 
deviations of the group means from the general mean, the squared 
deviation of each group mean being weighted by the number of 
cases on which it is based. This weighted sum divided by (r— 1) 
is still a valid estimate of the population variance under the hy- 
pothesis of random sampling, and may be used as before to test 
this hypothesis. The sum of squares for total may of course be 
found as before, and the sum of squares within groups is still equal 
to the difference between the sums of squares for total and methods. 
If we can assume homogeneity of group variances, we can still 
compute the standard error of any methods mean by dividing the 
error variance by the number of cases on which that particular 
mean is based, and extracting the square root of the result. . 
We may note, finally, that in any methods experiment of this 
type, performed in a single school, any conclusions drawn are 
Strictly applicable only to that school. In the illustration used 
(page 93), Method B seemed superior. We have already observed 
that the apparent superiority of Method B may be due only to 
the fact that a better teacher was used with Method B than with 
the others, or that some other extraneous factor operated sys- 
tematically in favor of Method B in the experiment. Let us as- 
Sume, however, that all such factors had been very carefully 
€qualized in the experiment, and that Method B really is superior 
in this school. Tt still does not follow that Method B is the best 
Of these methods in other schools. The pupils in this particular 


school may previously have been taught by а method similar te 
ethod B, so that they could begin using it at once with full ef- 


fectiveness, while the other methods may have been strange to 
them, and much of their time during the experiment may have 

cen spent in becoming acquainted with the method itself. How 
effective a particular method may be in a particular school depends 
“pon the previous experiences of the pupils, or upon study habits 
Previously acquired by the pupils in that school, and these may 
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differ systematically from school to school. This particular de- 
sign, then, has certain very serious limitations. In the first 
place, the error estimate takes into consideration only the first 
of the two sources of error mentioned on page 79 (the pupil vari- 
able), and this may often be the less important of the sources of 
error present. In the second place, the method which is best in 
one school may not be best in another, and hence any recommenda- 
tions concerning methods, which are based on the results of the 
experiment, must be restricted to the particular school involved. 
Finally, the number of pupils available in any one school is usually 
quite small, and hence the degree of precision attained is usually 
low. We shall therefore be particularly interested in the designs 
which avoid these limitations, and which are discussed in the suc- 
ceeding sections. 


4. ANALYSIS INTO THREE COMPONENTS: ANALYSIS OF POOLED 
RESULTS OF DUPLICATED EXPERIMENTS IN RANDOMLY 
SELECTED SCHOOLS (GROUPS OF UNIFORM SIZE) 

In order to attain high Precision, to be able to draw conclusions 
applicable to schools in general, and to secure a comprehensive 
error estimate, it is necessary in methods experiments to include 
pupils from a number of different schools. The design usually 
employed in such instances is like that of Design IV, described on 
page 82. We shall now consider the appropriate procedure for 
analyzing the results obtained with this type of design. 

This procedure may perhaps be most readily explained with 
reference to a concrete illustration. Suppose, then, that we have 
conducted an experimental comparison of 3 methods involving, 
say, 5 schools, each of which has provided 60 pupils. Suppose 
that, at the beginning of the experiment, we had in each school 
divided the pupils into three classes of 20 pupils each, and had as- 
signed these classes at random one to each of the three methods. 

Our “experiment” would then really consist of five duplicate 
experiments, one in each school. The results in any one school 
could, of course, be analyzed as in the illustration of the preceding 
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section, but the same procedure could not be applied to the col- 
lected results from all schools. That is, it would not be valid to 
throw all results into a single table like Table 6 (page 93), and then 
apply the F-test to the variances for methods and within groups 
in this table. The reason for this is that the procedure of the pre- 
ceding section assumes that the pupils under each method were 
randomly selected from all pupils involved in the experiment, 
whereas a “methods group” thus consisting of intact groups from 
several schools could not be considered as a random sample, even 
though the pupils in each school were randomly selected from the 
available pupils in that school. We must then have some other 
Way of “pooling” the results from our five duplicated experiments 
that will permit a valid estimate of error for the combined results. 
If it were not for the objection already raised, this “pooling” 
Could be effected by analyzing the variance of the 15 class means 
9n the criterion test in exactly the same way that we analyzed the 
variance of the 20 pupil scores in the example accompany’ ing 
Table 6. While this procedure would not be valid, it will be worth 
While to consider it more specifically, since through it we may bet- 
ter arrive at an understanding of a more appropriate procedure. 
Let us assume for the moment, then, that we propose to analy 25 
the variance of the т 5 class means by the method of the preceding 
Section, dealing with class means as we previously dealt with in- 
dividual pupil scores. This would involve a tabular arrangement 
Of the means into three columns of five means each, one column for 
©асһ method. It should at once be clear that some of the variance 
In the total distribution of 15 class means would be due to system- 
atic differences in achievement from school to school. These school 
differences would make the variance in each methods column (con- 
taining class means from 5 schools) considerably larger than if the 
Classes Were random samples from all pupils in the schools involved. 
e Other words, the school differences would increase the variance 
b ipid groups fin this case groups of means) which would be used 
aS the error variance in evaluating the meihods variance. At the 
Same time, the design of the experiment would tend to eliminate 
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the effect of school differences from differences in methods means, 
and would thus reduce the magnitude of the methods variance (as 
compared to the situation in which all classes were random samples 
from all pupils in the entire sample of 300). This is because the 
same 5 schools would be represented in each column, and hence 
the superiority or inferiority of any one school would tend to affect 
all methods means alike. 

If we followed the procedure of the preceding section with these 
15 class means, then, we would be eliminating school differences 
from the methods variance without eliminating them from the error 
variance. Hence, while our experiment would really be more pre- 
cise than if all pupils were random samples from all pupils in- 
volved, our analysis would make it appear less precise. This, a$ 
we have noted before in connection with Design IV of Chapter IV, 
is an error that has characterized many methods experiments in 
the past. It is clear, then, that we must eliminate school differ- 
ences from our error estimate as well as from our methods differ- 
ences. 

This can be done, without altering our experimental design, bY 
analyzing the results in another way. It involves analyzing the 
total sum of squares for the class means into three components: 
the sum of squares between methods, between schools, and a “Te” 
mainder” which is left when the sums for methods and schools are 
subtracted from the total. As before, the rs class means would be 
arranged in a 5 X 3 table, but in this case we would take care tO 
have the three class means for each school in the same row. The 
three columns would then represent the methods, and the 5 rows 
the schools. We would find the sum of squares for the methods 
variance exactly as we did in Table 6 (except that we would be 
dealing with class means instead of pupil scores). We would then 
write opposite each row the total and mean for the row (or school), 
and then proceed to find the sum of squares for the rows just 45 
we did for the columns. In other words, we would find the product 
of the total and mean for each row, sum these products, and sub- 
tract the product of the general mean and grand total. The result 
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WO: 
o Live sum of squares for schools. The sum of squares for 
Dui. 1С corresponds to that for total in the example of the 
B v edm would be found as before by squaring each class 
" = as E these squared means, and subtracting the product 
E d А eral mean and grand total. We would then subtract the 
а елее sida for methods and for schools from the sum of squares 
"hes y secure the remainder sum of squares. The d.f. for 
the dj. * uld be one less than the number of schools, or 4, just as 
The FR ^^ methods is one less than the number of methods, or 2. 
^n ud . ed classes would be one less than the number of classes, 
The reais e d.f. for remainder would then be 14 ~ 4 ~ 2 = 8. 
find x qeu for methods, for schools, and for remainder would be 
‘Fauld “al per each sum of squares by its df. The final step 
«ed е test the methods variance by finding its ratio (F) to 
Peen eli nder variance, from which school differences have now 
В minated. 
E continuing further with a discussion 
ا‎ to illustrate the arithmetical proc С 
ihe dem that the results for our experiment are as summarized in 
ins н table (Table т). The scores for the individual 
epe i not given, since they are not essential to an understand- 
e procedure. The student should have no difficulty in 


of this procedure, it 
esses just described. 


TABLE 7 
LASS MEANS IN A METHODS EXPERIMENT 


AN 
ALYSIS OF VARIANCE OF C 
ScHOOLS AND 3 METHODS 


INVOLVING 5 


Schools 
Totals Means 


Method 


66.20 22.0667 


School т 20.75 20.00 2545 
* 2 34.60 18.75 29.49 82.75 27-5833 
« 3 20.55 24.05 28:05 81.65 27.2167 
« 4 3915 22.65 30.60 92.40 30.8000 
“ § 3240 27-10 28.50 88.00 29.3333 
411.00 = GT 


Met 156.45 112.55 142.00 
27407 GM 


Means 31.29 22:51 28.4 
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following the computation with the aid of the preceding descrip- 
tion. 
GT - GM = 11,261.40 
Total sum of squares (sum of squares for classes) 
= 20.75? + 34.607 + +--+ + 28.50? — 11,261.40 = 446.88 
Sum of squares for methods 
_ 156.45? + 112.55? + 142.007 
5 
Sum of squares for schools 


66.20? + 82.757 + =- + 88.007 
3 
Sum of squares for remainder 


— 11,261.40 = 200.22 


— 11,261.40 = 131.43 


= 446.88 — 200.22 — 131.43 = 115.23 


Sum of 
Squares Variance 


Methods (M) 


Schools (S) 
Remainder (M x S) 
Classes 


F — 100.11/14.40 — 6.95 
(0.7. for F = 2 and 8) 


(Note: Ordinarily the class means would be carried to more deci- 
mal places, but in this case, since the divisor for each class was 20; 
each class mean was even at the second decimal place.) 

In terms of this illustration, we may now consider more specifi- 
cally the nature of the remainder variance that is obtained in an 
analysis of this type. As we have seen, it is the part of the total 
variance of class means that is left after we have “taken out” the 
parts due to methods differences and to school differences. 1n 
other words, it is due to the differences that still exist between class 
means after systematic methods and school differences have been 
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eliminated. It could be computed directly, rather than by sub- 
traction, as follows: First, “correct” each class mean so as to elimi- 
nate methods differences. Do this by adding to or subtracting 
from each class mean in each column the deviation of the mean of 
that column from the general mean. For example, in Table 7 the 
mean of column A is 3.89 units above the general mean; hence we 
Subtract 3.89 units from each class mean in column A, thus making 
, the column mean equal to the general mean. After a similar cor- 
rection had been made in all columns, all methods means would 
be the same, and all would equal the general mean. Second, again 
Correct these once corrected class means SO as to eliminate school 
differences. This is done by adding to or subtracting from each 
class mean in each row the deviation of the mean of that row (or 
School) from the general mean. 


After the first correction, the cla 
follows: 


ss means of Table 7 will be as 


School Means 


School: 16.86 24.89 2445 22.0667 
© 2 30.71 23.65 28.40 27.5833 
LEE 25.66 28.94 27.05 27.2167 
“ 4 3526 27:54 29.60 30.8000 
“ 27.50 29.3333 


Methods Means 


After the second correction they will be as follows: 


School Means 


22.1033 
30.5267 
25.8433 


31.8600 
26.5767 


27.40 
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The student is strongly advised to check for himself the com- 
putations involved in these corrections. After the double correc- 
tion, all row means and column means will equal the general mean, 
i.e., all systematic school differences and methods differences will 
be eliminated. It is apparent, however, that these corrections do 
not make the class means equal. The corrected class means will 
still differ, and the sum of their squared deviations from the general 
mean is the same as the remainder sum of squares previously com- 
puted by subtraction. Again, the student is strongly advised to 
check this last statement for himself in the example given. In 
actual practice, of course, we never compute the remainder sum 
of squares in this laborious fashion, since, as its name implies, it 
can be so readily obtained by subtraction. However, it helps 
considerably to understand the nature of the remainder variance 
if one actually computes the corresponding stim of squares in this 
fashion in a concrete example. 

It is illuminating to note, incidentally, that the variance for 
schools can be computed from the first of the corrected tables just 
presented (that in which only methods differences have been 
eliminated) by applying the procedure of Section 2 (pp. 93 ff.) to 
this table, but dealing with rows instead of columns. (Again the 
student should check this for himself.) The analysis into three 
components is then really a simple extension of the method of 
analysis into two components. ‘That is, the procedure employed 
in Table 7 consists essentially of an analysis of the total variance 
into methods and within methods components by the procedure of 
Section 2, followed by a second application of the same procedure 
to analyze the within methods variance into the schools and re- 
mainder components. 

Let us now consider what it is that causes these doubly corrected 
class means to differ. 

In the first place, such differences might remain because of 
chance alone. This is clear with reference to our example if we 
visualize what would happen should a homogeneous sample of 300 
cases (all of whom had been taught by the same method) be ran- 
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domly divi ; 
ү نا ب‎ 15 equal groups, and the means of these groups 
ere ould = ina 5 X 3 table. The row means and column 
Силан 3 erasa result of the random assignment of pupils 
foreach тА ut these differences could be eliminated by correc- 
ens Е gi which have been described. These corrections, 
Eon ceci only eliminate differences in row means and col- 
ll Metin , and would not eliminate differences in groups means 
Sarip мал" row or within the same column. It would be in- 
Men ae to know how much of the remainder variance 
s мее uted to chance alone, and a means of so doing will 
irm in the next section (Section 5). . 
any iie place, our corrected means might differ because of 
ae j ed variables which created real differences in achieve- 
Biete or № ass to class in the same school quite apart from the 
Кайрлы ^ methods. In our illustration, for example, the best 
mti aee 1 might have been assigned to Method B and 
ling chers to Methods A and C, and this might account at 
n part for the relatively better results for B in School 1 than 


in th 
Vim other schools (see the table of corrected means). Any such 
trolled variables would in general tend to increase the vari- 
they might accidentally 


ance} 
in class means, although in some cases 
A way of dealing with 


Count 
eraci some of the chance differences. 
ered in Section 5 following. 


thi 
چ‎ contingency will be consid 
Of rea] Te place, the corrected class means might differ because 
to scho ifferences in the relative merits of the methods from school 
more ol. Method B may really be a relatively better method (е. 
nearly equal to A and C) for the pupils i ool 1 than for 


hennes 
eo in School 2, due to differences i 
experiences of the pupils in these schools. Such differences 


m 
ad occur, for example, if the relative merits of a method 
ously ses school are dependent upon the methods that have previ- 
ауе ae employed in that school, and if these previous methods 
Since x from school to school. К 
Ue to a e remainder variance may 1n 
ny or all of these three factors 


this type of experiment be 


it is difficult to give it an 
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appropriate name. It is really the result of the interaction of all 
three factors, and may therefore be referred to in general as an 
"interaction" variance. Since we shall later have occasion to deal 
with other interaction variances, we will refer to it specifically 
as the “methods X schools” variance (read methods by schools), 
or as the interaction of methods and schools, or as the M х S 
variance. This notation is quite appropriate, since the number of 
degrees of freedom for the M Х S variance is equal to the product 
of the degrees of freedom for methods and schools. 

We may now appreciate more fully the exact nature of the test 
of significance that we have applied to the methods variance. We 
have just noted that the M x S variance is in part due to chance, 
and in part to azy uncontrolled variables which may create real 
differences in achievement from class to class in the same school, 
but which cannot be distinguished in that school from real differ- 
ences in the relative merits of the methods. The error estimate (or 
the test of significance) based on the M x S variance thus takes into 
consideration all sources of error that have been randomized within 
each school. It is for this reason that the methods of analysis of 
variance represent so great an improvement over the statistical 
procedures that have heretofore been generally employed in educa- 
tional research. The older procedures, based on the assumption 
of simple random sampling, were not only invalid because they 
were applied to samples that were not random, but they also ig- 
nored all sources of error except the pupil variables. 

The M X S variance, of course, does not take into consideration 
any source of error that has not been randomized, or which operates 
systematically in favor of a certain method in all schools. Con- 
sequently, it is extremely important to exercise great care in in- 
suring that as many as possible of these sources of error are actually 
randomized. This is perhaps best done by first assigning the 
classes to the teachers, classrooms, hours, etc., and then employing 
the “random numbers” procedure (pages 24 ff.) in assigning the 
methods to the classes. If this is done, then all factors (such as 
the teacher-variable) will be randomized which are independent 


ANALYSIS INTO THREE COMPONENTS 113 


of the methods themselves. We shall see later that it is not neces- 
sary, except for special purposes, to assign the pupils at random to 
the classes, although it is well to do so if that is convenient. 

We have seen that the M X S variance may be due in part to 
real differences in the relative merits of the methods from school 
to school. This possibility is of particular interest with reference 
to the test of significance based on the M X S variance. If there 
are any real differences of this kind, it is particularly appropriate 
to base our test of significance upon these differential or interac- 
tion effects, since if one method is to be recommended for general 
use, it must be best in all schools and not only in some, or at least 
it must be best in most schools. If the schools in the experiment 
are not a random sample of all schools to which we wish to gen- 
eralize, the possibility must also be considered that the total 
superiority of one method in the experiment is due to biased selec- 
tion, that is, to an unintentional selection of schools which hap- 
pened to favor that method. This does not mean that we must 
have a random sample of schools to make a methods experiment 
meaningful, but it does mean that if the sample of schools is not 
random, we must limit our recommendation to schools like those 
that participated in the experiment. 

It may be well to draw specific attention in this example to the 
effect upon our error-estimate of *taking out" the school differ- 
ences. On page 105 we had suggested, for developmental purposes, 
that we analyze the variance of class means by the methods of 
Section 2, pp. оз f. In other words, we had suggested analyzing 
the total sum of squares into its methods and within methods com- 
Ponents, and using the latter as the basis for the error term. Had 
we done this, since the sum of squares for within methods is equal 
to the sum of squares for schools plus that for M x S, our error 
Sum of squares would have been 131.43 + 115.23 = 246.66, with 
8+4=12 df. Hence, had we not “taken out” schools, our 
error variance would have been 246.66/12 = 20.55, whereas with 
Schools “taken out” it is 14.40. It is clear, then, that our error- 
estimate would have been seriously inflated had we not eliminated 


from it the effect of the school differences. 
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Reference has been made earlier to the basic assumption of 
homogeneity of variance. It must be remembered that in the 
analysis just considered we have been concerned only with class 
means; in the example used, in fact, we knew nothing at all about 
the variability of pupil scores within individual classes. In apply- 
ing the F-test based on the ratio of the M and M х S variances, 
we must assume only that the class means, after having been “cor- 
rected” to eliminate school differences, are homogeneous in variabil- 
ity from method to method. (The M X S variance is, essentially, 
the variance of class means within methods, school differences 
having been eliminated by the arithmetic of the analysis.) We 
do not have to make any explicit assumption, however, that the 
variance in pupil scores is fundamentally constant from school 
to school. This is an important feature of this design. Evidence 
will be presented later (pp. 132 ff.) that the variance (within schools) 
in educational achievement is in fact heterogeneous, but evidence 
will also be presented (pp. 139 ff.) to show that the variance in class 
means may nevertheless remain approximately constant from 
method to method, and that the test of significance based on the 
ratio of M and M X S variances may remain valid in spite of the 
heterogeneity of pupil variance from school to school. 


5. ANALYSIS INTO FOUR COMPONENTS: ANALYSIS OF POOLED 
RESULTS OF DUPLICATE EXPERIMENTS IN RANDOMLY 
SELECTED SCHOOLS OF UNEQUAL SIZE 

In the analysis of the preceding section we dealt only with class 
means, and gave all classes the same weight. This was quite 
proper when all classes were of the same size, but it would not be 
defensible, except as an approximate procedure demanded by 
special circumstances, if the classes varied in size from school to 
school. We will now consider the procedure which is appropriate 
in the latter situation. 

In order to demonstrate that this procedure is essentially the 
same as that just considered in the preceding section, we shall 
first apply it to the same data used to illustrate that procedure. 
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The original data from which the class means of Table 7 were 
derived were the scores of 300 pupils on the criterion test. We 
shall now need the total (T) as well as the mean (M) for each class. 
"These totals and means are given in Table 8. 


TABLE 8 


ANALYSIS OF VARIANCE OF PUPIL SCORES IN A METHODS EXPERIMENT 
INVOLVING 3 METHODS AND 5 SCHOOLS 


(Same data as in Table 7) 


pa 
Schools 
I 2 
T M T M 
Method A 415 2075 692 34.60 591 20.55 
Method B 400 20.00 375 18.75 481 24.05 
Method C 5090 25.45 588 20.40 56r 28.05 
School Totals 1324 22.0667 1655 27.5833 1633 27.2167 
and Means 
4 5 Methods 
T M m M Totals Means 


Method A 783 39-15 648 32.40 3129 31.29 


Method B 453 22.65 542 27.10 2251 22.51 

Method C 612 30.60 570 28.50 2840 28.40 

School Totals 1848 30.8000 1760 29.3333 8220 = GT 27.40= GM 
and Means 


The arithmetic of computation is now a combination of that in 


Sections 2 and 3 preceding, as follows: 
The sum of squares for methods, computed as in Section 2, page 


94, is 


2 2 8407 
(3129 + == + 28407) _ (8220)(27.40) = 4004.4 
1 


The sum of squares for schools, similarly computed, is 


1324 + 16° + 163° + 1848 + 1169 _ 1.38 = 2628.6 
60 
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The sum of squares for classes, also similarly computed, is 


415° + 400° + 500° + 692? + - + + 542? + 570? 
20 


225228 


= 8937.6 
We may now see that the sums of squares for methods, schools, 
and classes bear the same relation to each other that they did in 
Table 7, although each has been made twenty times as large as 


before. 
The sum of squares for M X S is obtained as before by subtract- 


ing the sums of squares for methods and for schools from the sum 
of squares for classes. The result is 
8937-6 — 4004.8 — 2628.6 = 2304.6 


With these results, we could proceed at once to compute the 
methods and M X S variances and apply the same test of signifi- 
cance described in the preceding section. However, if the pupils 
in each school had been assigned at random to the classes in each 
school, we might wish to know the variance within classes in order 
to evaluate the M X S variance. We will therefore include the 
computation of the variance within classes in this illustration, al- 
though ordinarily it would not be needed. 

To compute the variance within classes, we must first compute 
the sum of squares for total, that is, for all 300 pupil scores. This 
is done, as in Section 2, by squaring each of the зоо scores, adding 
these squared scores, and subtracting the product GM - GT. 
(Without an automatic computing machine, a more convenient 
procedure is to construct a frequency distribution of the зоо scores 
and find the sum of squares by the short method, as was suggested 
on page тот.) The sum of squares for total in this case is 49,674. 

The sum of squares for total may be considered, after the manner 
of Section 2 of this chapter, as consisting of two components, one 
of which is that for differences between classes, and the other for 
differences within classes. In other words, we may think of the 
variance of the зоо scores as having been analyzed only into the 
“between classes" and “within classes” components, just as we 
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analyzed the variance of the 20 scores in Table 6 into "between 
methods" and “within methods” components. Hence, the sum 
of squares for within classes is equal to the difference of the total 
sum of squares and the sum of squares for classes, as follows: 
49764 — 8937.6 — 40826.4 

The d.f.’s for methods (44), schools (S), and M X S are the same 
as before. The d.f. for total is one less than the number of pupils, 
ог 299. The df. for within classes is the d.f. for total minus the d.f. 


for classes, or 299 — 14 = 285. 


We may now arrange our results in tabular form, as follows: 


Sum of 
Squares Variance 


M 4004.4 2002.2 
S 2628.6 657.1 


2304.6 288.1 


MxS 
40826.4 143-3 


Within Classes 285 
Total 49764- 


nd M X S variances 
(This would be 
our test of 


We now see that the ratio between the M a 
is exactly the same as the analysis of Table 7. 
true only when all classes are the same size.) Hence, 
significance is the same as before, and all that was said about the 
M X S variance in the preceding section still applies here, whether 
or not the classes are of equal size. However, we now know the 
variance within classes, and if the pupils have been assigned at 
random to the classes in each school, this variance may be used to 
evaluate the M х S variance. We have noted earlier that the 
M XS variance is due in part to the pupil-variable (variance 
within classes), in part to uncontrolled variables outside the classes, 
and in part to possible real differences in the relative merits of the 
methods from school to school. If all extraneous variables were 
completely equalized or controlled, and if the relative effects of 
the methods are the same in all schools, the M X S variance should 
be the same (except by chance) as the variance within classes. 
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In this particular case, however, we see that the M x S variance 
is larger than the within classes variance. The ratio (F) between 
these variances is 2.011. For 8 and 285 d.f., an F of 2.58 is re- 
quired for significance at the x per cent level, and of 1.97 at the 
5 per cent level. Hence, the interaction variance may be consid- 
ered significantly larger than chance would allow. This indicates 
(although not conclusively) either that we have failed to control 
the extraneous variables (such as the teacher-variable) or that there 
are real differential effects of the methods in different Schools, or 
both. If we could be certain that the extraneous factors had been 
completely equalized, we could take this as evidence of the presence 
of a real differential effect. In practice, since complete control is 
impossible, we cannot distinguish between the effects of these 
two factors. 

It may sometimes happen that the M X S variance turns out 
to be smaller than the variance within classes. This is particularly 
likely to happen if only a few schools are involved, since then the 
M X S variance will be based on only a few degrees of freedom, and 
will be quite unstable in relation to the within classes variance. 
If all extraneous factors were perfectly controlled, and if there were 
no real interaction of methods and schools, then, under the hypoth- 
esis of random sampling which we wish to test, the M X S variance 
and the within classes variance are both estimates of the same thing, 
and will differ only by chance. Since the extraneous factors are 
never fully controlled, and since some interaction is likely, we 
would always expect the M x S variance to be the larger. If 
it is less, it is so only by chance, and in this case the within classes 
variance constitutes a better estimate of error than the M х 5 
variance. The methods variance may then be divided by this 
(within classes) error variance to secure the F for the test of 
significance. 

Tt may also quite frequently happen in experiments of this type, 
as is true in our example, that the methods variance is significantly 
larger than the within classes variance, but not significantly larger 
than the M X S variance, This would have the same meaning as 
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in the analysis of Section 2. That is, it would mean that the 
methods differences are almost certainly not due entirely to the 
pupil variable — although they might be due to uncontrolled 
variables, or to the accidental selection of schools which happened 
to favor certain methods. In this situation, 7f the extraneous 
variables had been very eflectively equalized, one might safely 
conclude that there are real differences between methods so far as 
the particular schools involved are concerned, although we could 
still not safely generalize concerning other schools. In general, 
it is not of very much practical value to know the variance within 
classes, although if the pupils have been randomized it is best to 
compute it in order to cover the contingency that the M X S may 
by chance be too small to provide a good error estimate. Our 
principal interest in the procedure of this section, then, is not that 
it permits us to compute the within classes variance when the pupils 
have been randomized, but that it allows for differences in size of 
class from school to school in the F-test based on the M X S 
variance. 

It may be well to mention here still one other reason why we 
would ordinarily have relatively little interest in the within classes 
variance. The use of this variance as an error term involves the 
assumption of homogeneity of variance within classes from school 
to school. Evidence will be presented in Sections 8 and 9 that 
this assumption will probably not be satisfied in the typical meth- 
ods experiment, and that the validity of the F-test involving the 
within classes variance will be appreciably lowered as a result. 


6. RULES FOR ANALYZING POOLED RESULTS OF DUPLICATED 
EXPERIMENTS IN RANDOMLY SELECTED SCHOOLS 

We shall now apply the procedure of the preceding section in a 
n involving differences in size of class from 
school to school. For the later convenience of the student, we 
shall provide, along with this illustration, a set of definite rules 
which may be conveniently followed in practical applications. It 
will be assumed in these rules that the experiment has been prop- 


concrete illustratio 
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erly designed and controlled. The essential elements in the design 
are that the classes are of equal size in each school (but not neces- 
sarily from school to School), that the classes, teachers, rooms, 
etc. have been randomly assigned to the methods, and that com- 
parable criterion measures are secured for all pupils at the close of 
the experiment. These rules are so organized as to cover both the 
situation in which the pupils in each school have been randomly 
assigned to the classes within that school and that in which they 
have not been so assigned. 

The parenthetical remarks following each rule are concerned 
with the concrete example used to illustrate them. This concrete 
example is based on an experiment performed with four methods 
in five schools. There are a total of 440 pupils, до in School т, 
120 in School 2, 76 in School 3; 152 in School 4, and 52 in School 5s 
The pupils in each school were randomly assigned to four classes 
of equal size, thus resulting in 20 classes in the whole experiment. 
The actual scores of the 440 pupils on the criterion test are not 
given, since these are not essential to an understanding of the pro- 
cedure. 

In this example we shall let T, represent the sum of the scores 
for a single class, T; the total for a single school, Т, for all pupils 
under a single method, and GT the grand total. The numbers 
of pupils on which these totals are based will be represented by no 
Ms, Ny and №, respectively. 
Step 1: Find the total (Т) of the criterion 

each CLASS separately, 

For the example, these totals are given in the following 
table. For instance, the total for the A-class in School т 
is 603. 

Step 2: Find the total for each school, for each method, and for the 
entire sample (grand total) by adding class totals. 
be done most conveniently if the class totals have been 
arranged in a table like that above. The grand total 


(GT) should equal both the sum of the methods total and 
of the school totals. 


Scores for the pupils in 


This may 
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Methods 
A B с р Ts > T2/ne 3/ts 
School 1 боз 571 558 592 2324 135147.8 135224-4 


School 2 i711 1624 1586 1619 6540 356715.1 356430.0 
School 3 i120 1082 1006 1083 4201 242634.2 242272.1 
School 4 2052 1975 1939 1037 7003 411132.1 410904.0 
School 5 741 716 678 700 2835 154724.7 154562.0 


Tm 622; 5968 5767 5931 23893= GT 


Methods 
Means 56.6% 54.25 52.43 53-92 GMX GT= DEI mm 


These totals for the example are given in the preceding 
table. 

Step 3: Square each methods total, add, divide the sum by the number 
of pupils under each method, and subtract GT*/N (= GT х 
GM) to secure the sum of squares for wETHODS. (Carry 
the result to at least four significant digits.) 

хта GT 6T 5968? + 5767 + 5931 
Nyy N IIO 
— 1297444.2 = 988.6 
In machine computation, the squared totals may be 
cumulated in the lower dial for division by the number of 
cases. Hence, no squared total need be recorded. 


ol separately, square the total (T,) and divide 
pupils in the school. (Carry each 
mber of decimal places as in the 


Step 4: For each scho 
by the total number of 
result to the same nu 


final result of Step 3-) 

For School 1, Ту = 2324°/до = 135024.0. The 
results for the other schools are similarly computed, and 
are given in the T. 2/n, column of the preceding table. 


Step 5: Add the results of Step 4 for all schools and subtract GT*/N 
to secure the sum of squares for SCHOOLS. 
Z(T2/n) – GT?/N = (1350244 + °° + 154562.0) 
— 1297444.2 = 1748-3 
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Step 6: 


Step 7: 


Step 8: 


Step 9: 


Step то: 


Step тї: 


Step 12: 
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For each school separately, square each class total, add, and 
divide the sum by the number of pupils per class in that 
school. (Carry each result to the same number of decimal 
places as in the final result of Step 3.) 


603* + 5717 + 558° + 592° 
For School т, Z T2/n, = °3 + 571 5 £ 


= 135147.8 
Add the resulis of Step 6 Jor all schools and subtract GT*/N 


lo secure the sum of squares for CLASSES. (Carry to same 
number of decimal places as in result of Step 3.) 


Z(Z T2/n) — GT?/N = 135147.8 + -- 
— 1297444.2 = 2909.7 


* 154724.7 


Subtract the sums of Squares for METHODS and SCHOOLS 


from that for CLASSES, to secure the sum of squares for 
МХ 5. 


In the example: 
2909.7 — 1748.3 — 988.6 = 172.8 


Divide the sum of Squares for METHODS by one less than the 


number of methods to Set the variance for METHODS. The 
divisor is the d.f. for METHODS. 


988.6/3 = 329.5 


Divide the sum of Squares for м X s by the product of the 
degrees of freedom for METHODS and SCHOOLS to get the 
М Х 5 variance. The divisor is the df. form x s. 


172.8/(4 X 3) = 1728/12 = 14.40 
Divide the variance for methods by the variance for M X s. 
The result is the ғ used in the test of significance, 
F = 329.5/14.40 = 22.88 
Compare the x Just found with values given in Table 4 for 


the corresponding df. to determine the Significance of the 
methods differences. 


[In the example, the d.f.’s for the F computed are 3 and 
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12. In Table 4 we see that for these d.f. an F of 5.74 is 
required for significance at the 1 per cent level. Hence, 
the F is clearly significant.] 


Note: This is as far as the computation need be carried 
if the F is not significant, and if the pupils were 
not assigned at random to the classes in each 
school. If the F is significant, and there are 
more than two methods, one would wish to evalu- 
ate the individual differences in methods means, 
as in steps 13 to 15 following. 


Step 13: To find the standard error of a difference between means for 
any two methods, divide the м X s variance by the total 
number of pupils under one method, multiply the quotient 
by 2, and extract the square root of the result. 


[v 2 (14.40/110) = V .2618 = .512] 


Step 14: Compute the mean for each method. 

[Given in the table.] 

Step 15: To find the maximum error at the т per cent level in a dif- 
ference in methods means, find the value of t in the last 
column of Table 3 for the df. of the m X s variance, and 
multiply the standard error of the difference by this value 
of t. 

[The d.f. for M x S is 12. For this d.f., a t of 3.005 
is required for significance at the т per cent level. Hence, 
the maximum error (at т per cent level) in the difference 
of two methods means is 3.005 X .512 = 1.54. Hence, 
the mean for method A is significantly higher than for 
any other method, the mean for B is significantly higher 
than for C but not for D, and the mean for C is almost 
significantly higher than for DJ 

Note: This is as far as the computation would be car- 

ried in any case in which the pupils had not been 
randomized in each school (including the case 
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Step 16: 


Step 17: 


Step 18: 


Step 19: 


ANALYSIS OF VARIANCE 


of matching within schools). It is apparent, 
then, that in spite of the complexity of the theory, 
the actual computational procedure in an analy- 
sis of variance is really quite simple. (For ex- 
ample, the time required by the writer to com- 
plete the preceding steps in the illustrative 
problem, using an electrical computing machine, 
was under one hour.) In fact, the time required 
for computation in an analysis of this type is 
less than that formerly required in similar situa- 
tions when the random sampling formulas were 
(incorrectly) used. 

If the pupils have been randomized in each school, 


and one wishes to evaluate the M X S variance, the 
procedure continues as follows: 


Secure the sum of the Squared. scores of all pupils on the 
criterion test and subtract GT*/N to secure the TOTAL sum 
of squares. Tf an automatic computing machine is not 
available or the number of cases is large, use the “short” 
method with a grouped frequency distribution of the 
Scores. (See page ror.) 


In the example, the sum of squares for total was 5891.2. 


Subtract the sum of Squares for CLASSES from the sum of 


Squares for TOTAL to secure the sum of squares for WITHIN 
CLASSES. 


5891.2 — 2909.7 = 2981.5 
Divide the sum of Squares for WITHIN CLASSES by the num- 


ber of pupils less the number of classes to secure the variance 


for WITHIN CLASSES. The divisor 15 the d.f. for WITHIN 
CLASSES. 


2981.5/(440 — 20) = 2981.5/420 = 7.10 


To evaluate the m Xs variance, divide the variance for 
M X by the variance for WITHIN CLASSES, and evaluate 


Step 20: 
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the ¥ as in Step 12, remembering that the D.¥.’s are those 
for m X 5 and WITHIN CLASSES. 
14.40/7.10 = 2.03 =F 


For 12 and 420 d.f. an F of 1.78 is required for signifi- 
cance at the 5 per cent level and of 2.23 at the 1 per cent 
level. Hence, the M x S variance may in this case be 
considered significantly larger than the within classes 
variance. This indicates that the relative effectiveness 
of the methods is not the same from school to school, 
and that the method whict. is best in one school may not 
be best in another. 


If the M X S variance is less than the within classes vari- 

ance, use the within classes variance as the error term in 

testing the methods differences. That is, divide the 

methods variance by the within classes variance, and eval- 

uate the F as in Step 12, taking care to use the d.f.s for 

methods and for the within classes variance. If this test 

proves significant, the individual methods differences 

may be evaluated as in Steps 13 and 15, except that the 

within classes variance (and df.) is substituted for the 
M x S variance (and d.f.). 

Мое: Steps 16 to 20 involve the assumption of homo- 

geneous variance within schools — an assumption 

which may be generally questioned on an a 

priori basis. (See Section 8, pp. 132 ff.) Before 

placing much reliance on the probabilities read 

from the F-table for Steps 19 and 20, therefore, 

one should make an objective test of the reason- 

ableness of this assumption. The test needed is 

described in the footnote on page 99. To apply 

this test, one must first analyze the results for 

each school separately (by the method described 

in Section 3, pp- 100 ff.) to secure the within 

classes (within groups) variance for each school. 
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The test on page 99 may then be applied to these 
variances, the within classes variances from the 
separate schools being represented by s^, s/?, 
+++, Sg in the test, k representing the number 
of schools. The corresponding degrees of free- 
dom for the within classes variances will then be 
represented by 2, 7, etc., and s^ will be equal to 
Ten, These values may then be substituted 
Ni 
in the expression for X? at the beginning of the 
footnote on page 99. If the value of X? is not 
significant, one is justified in assuming homoge- 
neous variance in Steps 16 to 2o (although a non- 
significant x? does not prove that the variances 
are homogeneous). If the value of x? is very 
significant, i.e., if the school variances are very 
heterogeneous, special methods of analysis should 
be employed, but these are beyond the scope of 
this book. The student interested in this pos- 
sibility should read “The Analysis of Groups of 
Experiments,” by F. Yates and W. G. Cochran, 
Journal of Agricultural Science, Vol. XXVIII, 


Part IV, October, 1938 (Cambridge University 
Press). 


The foregoing rules have been 


would not be involved). 


writer to deserve first consideration, since in the actual school situa- 


assign pupils (at random) to special 


recommended does not do justi 
within each school are divided i 
type of experiment is practica 


MODIFICATIONS OF ANALYSIS IN METHODS EXPERIMENTS 127 


can be derived from it than is suggested Ьу the preceding rules. 
In this type of experiment, the first step (as is suggested in the 
preceding note) should be to do an analysis of variance separately 
for each school. (This may have some political as well as statisti- 
cal value, since any co-operating schools may appreciate receiving 
an independent report on its own particular experiment.) If the 
result of Step 19 then shows the M X S variance to be significant, 
one may want to know why certain schools behave differently 
from others, and the individual school analysis will help to pick 
out the anomalous schools. Unless one method happens to be 
best in all schools, the correct deduction from the experiment may 
be that one method is best for certain types of schools (whose dis- 
tinguishing characteristics may be recognized and specified) and 
another method is best for other types of schools. Whether or 
not it may be possible to thus typify the schools within an experi- 
ment remains to be seen as we accumulate more experience with 
these analytical procedures, but it is a possibility that is worth 
keeping in mind. 


7. MODIFICATIONS OF DESIGN AND ANALYSIS IN METHODS 
EXPERIMENTS INVOLVING SEVERAL SCHOOLS 
3 We are now ready to consider more fully a fact of considerable 
Importance in methods experiments of the general type described 
In the preceding sections. We assumed in Section 2 (page 93) that 
the pupils in the single school involved were randomly assigned to 
the experimental classes. This assumption is still essential in the 
more complicated design for several schools if the within classes 
variance is to be used to evaluate the M x S variance, but it is not 
essential if we have no interest in the within classes variance, and 
Intend to use only the M X 5 variance as the error term. In the 
latter Situation, when we use M X S as error, we are considering 
the class as the unit of sampling, even though we do weight each 
class according to size. Since the class is the unit, the only as- 
Sumptions involved are: (i) that the classes are randomly as- 
Signed to the methods, (2) that the variance in “corrected” class 
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means is fundamentally constant from method to method, (3) 
that the distribution of “corrected” class means is fundamentally 
normal in form, and (4) that all classes in the same school are of 
thesamesize. Our generalizations then really apply to the popu- 
lation of schools from which the schools involved in the experiment 
were presumably selected at random, rather than to a population 
of pupils. If, then, we are not going to use the within classes 
variance to test the interaction, we need not consider the random 
assignment of pupils to the experimental classes as an essential 
element in the design of a methods experiment involving a random 
selection of schools. There should be no misunderstanding, how- 
ever, as to the importance of randomizing the classes (and the as- 
sociated error variables) with reference to methods in each school 
separately. 

One important implication of the foregoing is that we may, if 
we wish, or if no other procedure seems feasible, utilize the classes 
as they are found already organized in the schools which are to 
participate in the experiment, if these classes are of the same size 
(or very nearly so) within each school. In general, this would not 
be desirable, although it may often be necessary. The classes as 
found are likely to show large systematic differences in achieve- 
ment or ability, either because of a deliberate 
ing according to ability, 
educational experience и 


or unconscious group- 
or because of systematic differences in 
p to the time of the experiment. The 
precision of the experiment will of course depend on the magnitude 
of the M X S variance (or upon the variance of class means after 
School and methods differences have been eliminated). Hence, the 
use of classes as found would result ina relatively large M x S 
variance and a relatively low precision, some of which could be 
avoided by reorganization of the classes. The purpose of reorgani- 
zation is of course to reduce the variance in mean ability from class 


to class within the same school. One means of doing this, and in 


general a very good way, is to assign the pupils at random to the 
experimental classes in each school. A still m 


ore effective way of 
reducing the M X S variance is to match 


the classes in each school 


MODIFICATIONS OF ANALYSIS IN METHODS EXPERIMENTS 129 


separately on the basis of some initial measure of ability. It is 
not necessary that this matching be in any sense exact, or that it 
must be done on a pupil-to-pupil basis. Any procedure that is 
likely to result in a more nearly equal mean achievement of the 
classes (within the same school) than would result from random 
grouping would be so much to the good. If the matching is to be 
done on the basis of an objective measure of ability, such as the 
Score on an initial achievement test, perhaps the best procedure is 
first to assign the pupils “оп paper" to tentative experimental 
classes at random, then compute the mean of each of these tentative 
classes on the initial test, and then to exchange pupils between 
classes so as to make the means as nearly alike as possible. If the 
correlation between the initial and criterion test is high, this match- 
ing may very markedly increase the precision of the experiment. 
It is important that the matching be equally well done in all 
schools, otherwise the variance of the class means will be greater 
in some schools than others, and hence a basic assumption 
will be violated. If the matching is equally well done in all 
Schools, it is likely that the assumption of homogeneous variance 
will be even more nearly satisfied than if random grouping within 
Schools were employed. In general, then, an objective matching 
procedure, which can be made equally effective in all schools, is 
better than а subjective or inexact procedure, although any proce- 
dure may be employed that makes the classes more alike in mean 
ability than random classes would be. 

We have already noted that if it is necessary to utilize the classes 
as found, the method of analysis here described is still valid so far 
as the test based on the M X S variance is concerned. This is 
fortunate, since sometimes no reorganization of any kind is practi- 
Cable. While it is often possible to find schools willing to co- 
Operate in an experiment to the exten* of permitting their regular 
y schools are willing to allow any 


Classes to be used, relatively fev 
Physical reorganization (of extended duration) of the classes for 


€xperimental purposes. It may be noted, however, that the neces- 
Sity of dealing with intact school classes as found is not always a 
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very serious handicap. If the classes are all in the same building, 
if there was no deliberate ability grouping in the original assign- 
ment to classes and if the class organization of the previous year 
had not been retained, and if the experiment is performed early 
in the year before differences in teachers and other factors had had 
time to create large class differences in the same school, the preci- 
sion of this type of experiment might be practically as high as if 
the pupils were actually assigned at random to the classes at the 
beginning of the experiment. However, if the classes in the same 
“school,” now meaning school system, were drawn from different 
buildings, if these buildings were characterized by marked differ- 
ences in environmental and cultural patterns, and if the experi- 
ment were performed late in the school year when teacher differ- 
ences had produced still further variability, 
experiment would be relatively low — perha 
tem” differences were not eliminated at all. 


When intact classes are used, it will seldom happen that all 
classes in the same school are of uniform size, 


the precision of the 
ps as low as if “sys- 


If the size of class 


totals in place 


f the act pages 152 ff). The analy- 
sis will still allow for Systematic differences in class size from school 


to school, but ignores such differences within schools. This pro- 
cedure, and the test of significance involved, will be only approxi- 
mate in character, but in most instances will probably be just as 
satisfactory, for all practical Purposes, as if the classes in each 
School were actually uniform in size. If the classes vary markedly 
within schools, for example, if some classes are several times as 
large as others in the same school, the validity of this procedure 
might be appreciably lowered, but this situation can usually be 
avoided in planning the experiment, 
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А The preceding paragraph suggests also what to do about “miss- 
ing cases.” Very frequently, although the experiment was 
originally planned so as to provide equal classes in each school, 
some pupils may drop out of school during the course of the ex- 
periment, or for some other reason fail to take the criterion test. 
In this case, if the M X S variance is to be used as the only error 
term, the data may be analyzed in exactly the same manner as was 
Just described. 

It may be well, before closing this discussion, to draw specific 
attention to certain possible misconceptions. In the first place, 
there is some danger that the student may conclude, since the 
M X S variance takes into consideration or provides a valid error 
estimate for all uncontrolled but randomized errors, that there is 
no need to control any sources of error that may be randomized. 
We have already seen the advantage of equalising the pupil vari- 
able by matching the classes. Equally important advantages may 
be gained by equalizing any other error-variables. For example, 
if the classes within each school are taught by teachers of widely 
varying ability, this fact will tend to increase the M X S variance, 
and hence lower the precision of the experiment. Randomization 
of teachers only insures that the teacher-variable will not operate 
Systematically in favor of a certain method. Any means of 
equalizing the teacher-variable, such as having the same teacher 
teach all experimental classes in the same school, or trying to select 
teachers of equal ability, will tend to reduce the M X S variance 
and increase the precision of the experiment. In designing and 
conducting a methods experiment, then, every practicable precau- 
tion should be taken to render the experimental conditions or ex- 
traneous factors as much alike as possible from class to class within 
the same school. 

А second possible misconception has t 


Of school differences in an analysis of this З 1 
the worthwhileness of taking schools into consideration at all de- 


Pends entirely upon the magnitude of these differences. The sum 
9f squares for classes is equal to the sum 0 


o do with the significance 
kind. It may seem that 


f the sums of squares for 
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М, S, and M X S. Hence, it is true that “taking out” the sum 
of squares for schools reduces the sum of squares for error (M X 5), 
and thus increases the precision of the experiment, by an amount 
dependent upon the magnitude of the variance for schools. (See 
page 113.) Ifthe variance for schools is no larger than could be 
attributed to chance fluctuations in random sampling (that is, no 
larger than the variance within classes), then “taking out" school 
differences will not reduce the M X S or error variance. (The sum 
of squares for M X S will be reduced, but the d.f. will also be re- 
duced, so that the variance for M X S will remain undiminished, 
or may even be increased if the variance for schools is less than that 
for M X S.) This does not mean, however, that the method of 
analysis here described would not be needed if, in a given experi- 
ment, there were no systematic differences between the partici- 
pating schools (i.e., if the variance for schools were no larger than 
that for within classes). Even though the variance for schools 
were no larger than chance would permit, we would still have to 
employ this method of analysis in order to take into consideration 
in the error (M X S) term апу extraneous variables which operated 
within each school to create systematic differences from one 
methods group to another. It is quite conceivable that the vari- 
ance for schools could fail to be significant, and that there also may 
be no real interaction of methods and schools, but that we might 
nevertheless have a significant M X S variance due to uncontrolled 
variables such as the teacher variable. Hence, we would still have 
to use M X S as error to take these sources of error into considera- 
tion. Whether or not school differences are significant, the type 
of analysis based on the familiar standard error formulas designed 
for large random samples (see page 83) will not provide a valid 
comprehensive error estimate in an experiment of the type IV de- 
sign involving several schools. 


8. THE ASSUMPTION OF HOMOGENEITY OF VARIANCE 
One of the cardinal sins of educational research workers in the 
past has been that of taking too lightly the assumptions underlying 
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the derivations of statistical techniques. It would be particularly 
unfortunate if, in becoming acquainted with a technique which 
seems as promising for educational research as that presented in 
the preceding section, we should perpetuate this unfortunate 
tendency to pass over the basic assumptions without any Very 
critical consideration of them. 
| In the method of analysis just considered, one of the most 
important underlying assumptions is that of homogeneity of 
variance. 'This assumption entered into the analysis at two 
points. In the test of significance based on the ratio of M and 
M x S variances, it was assumed that the variance of the “cor- 
rected” class means is fundamentally constant from method to 
с . In the test of significance of the interaction, based on the 
йы is the M х S and within classes variances, it was assumed 
ыд E variance of pupil scores, after methods differences have 
eliminated, is the same from school to school. The latter 
assumption would also be involved if we were to test the school 
differences on the basis of the ratio of the S and within classes 
Variances, as well as in a test of the M varianc 
era consisting of within classes, which would be used if the 
X S variance turned out to be less than the variance within 
classes. 
К either of these assumptions is no 
Jat the sampling distribution of the o 
cie by the table for F, or that the F-test 
table will be invalidated. As was noted on page 99, there is 
these assumptions in the typical 
the assumption of homogeneity 
However, if only the latter 
ation may not be serious so far 
{ the preceding section is con- 


t satisfied, there is a danger 
served F’s will not be 
s of significance based on 


a reason to suspect each of 
ета ods experiment, particularly 
ыны from school to school. 
a dh ption is not satisfied, the situ 
ae usefulness of the procedure 0 à : 3 
сея We have noted (page 114) that the variance of “corrected 
Bn means might be homogeneous from method to method, even 
BT. the variance within schools is heterogeneous. In other 
s, it has been suggested that the test of significance based on 
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the ratio of the M and M X S variances might be valid even 
though the variance within schools is heterogeneous. Whether or 
not it actually is valid under these conditions, however, remains 
to be proved. 

There are, then, two crucial questions that should be answered 
before the procedure of the preceding section may be safely recom- 
mended for wide-spread use in methods experimentation. These 


at least a partial answer to 
that following (Section 9). 
r analyzed the scores made 


› briefly stated, consisted of 
Pupil scores within each school 


S Test D of the 1938 Iowa 
This is a carefully constructed 
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objective examination of achievement in arithmetic, intended for 
use at the sixth- to eighth-grade levels. The test requires go min- 
utes for its administration. This test was administered, in the 1938 
Iowa Every-Pupil Testing Program, to the seventh-grade pupils 
in over 200 schools, in 94 of which 20 or more seventh-graders were 
tested. From each of these 94 schools 20 seventh-grade pupils 
were selected at random by the method described on page 27. 
Only 20 pupils were used from each school in order to simplify 
the problem of determining the theoretical distribution of vari- 
ances. For the 20 pupils in each school, the variance of the total 
Scores was computed, and a distribution made of these variances. 
This distribution is given in Table o, in the f, column. The com- 
Putation of the theoretical frequencies (}) in this table was based 
on the fact that for random samples of N cases each the sampling 


distribution of the ratio as , in which s? is the variance of a sample 
с 


(5° = E d:/N) and c? is the variance of the population, is the same 
as the sampling distribution of X for № — І degrees of freedom, 


с? 2 
апа that s? is therefore distributed as (©) . X. In other words, 


the theoretical frequencies are based on the hypothesis that the 
School variances are homogeneous, that is, that they differ no 
More than would the variances of random samples of 20 cases 
басһ. In this case, g? was estimated by estimating the true vari- 
ance, according to (4), from the data for each school separately, 
oa taking the mean of these estimates. This best estimate of 
9^ Was 265.54. For то d.f., the value of x? which should be ex- 
ceeded in го per cent of all random samples is 23.900. Hence, if 


gre 265.54 _ 13.277, We should have 


we multiply this value by К 25 

(23.900 x r 3.277 = 317.32) the value of the variance which should 
© exceeded in 20 per cent of our samples. This value, it will be 

Noted, constitutes the lower limit of the fifth interval (from the 


Karl Pearson, “On the Distribution of Standard Deviations of Small Sam 
melrika, Volume то (1914-15), pp- 5227529- 


1 ples," 


Bio 
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top). The limits of the other intervals were similarly computed 
from the x? table, and the observed variances were tabulated 
with reference to these intervals. The theoretical relative fre- 


quencies (f;) were then readily derived from the X* table, and these 
relative frequencies were then expressed as absolute frequencies 


NE 
(i = 4 so as to make them comparable. 


It is at once apparent from Table 9 that the “spread” of the 
observed variances is greater than in the theoretical distribution. 
The observed frequencies (fo) are systematically smaller than the 
theoretical in the middle of the distribution and larger at the 
extremes. To determine whether or not this divergence is greater 
than can be attributed to chance, the X? test (described on pages 
37-40) was applied. The value of x? was 28.978. For 6 d.f. this 
value of Х? would be exceeded less than о.т per cent of the time if 


TABLE 9 
DISTRIBUTION OF VARIANCES OF SCORES OF SEVENTH-GRADE PUPILS ON A 
COMPREHENSIVE TEST OF ACHIEVEMENT IN ARITHMETIC (Теѕт D or 1938 
Iowa Evrnv-PuPrL Tests OF Basic SKILLS) IN 94 ScHoots (20 PUPILS 


FROM EACH SCHOOL), COMPARED WITH DISTRIBUTION EXPECTED ON 
HYPOTHESIS OF RANDOM SAMPLING 


Variances Differences 


(fo وت‎ 
480.51 and above 
447.27 — 480.50 
400.22 — 447.26 
361.19 — 400,21 
317.32 — 361.18 
287.97 — 317.31 
243.48 — 287.96 
203.83 — 243.47 
182.11 — 203.82 
154.69 — 182.10 
134.32 — 154.68 
I13.75 — 134.31 
101.34 — 113.74 
101.33 and below 


m 
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our hypothesis were true; hence, we may very confidently reject 
the hypothesis of homogeneous variance." 

Table то presents the resulzs of similar analyses for other exami- 
nations administered in the Iowa Every-Pupil Testing Programs. 
The manner in which this table may be read may be illustrated in 
the case of Part A of the table. The abbreviated title for Part A 
gives the title of the test involved, and indicates that the variance 
of pupil scores on this test was computed for 20 randomly selected 
seventh-grade pupils in each of 89 schools. Only the lower limits 
of the variance intervals are given in the first column (except for 
the bottom interval, whose upper limit only is given). Only the 
relative observed frequencies are presented, in the f; column, oppo- 
site the theoretical frequencies (1) expected on the hypothesis of 
random sampling. For the distribution in Part A, the value of 
X* computed in the test of goodness of fit would be exceeded by 
chance less than ten per cent of the time, i.e., X? was significant 
beyond the то per cent level. According to the Neyman-Pearson 
\-test, the probability of exceeding the observed heterogeneity by 
chance is approximately the same as the probability that a measure 
Selected at random from a normal distribution will deviate 4.78 
standard deviations from the mean of the distribution. This 
probability, according to Table 17 in the Appendix, is less than 


.О0оої. 
It may be noted in Table 1o that there is an apparent tendency 


toward increased heterogeneity of variance at the higher grade 
levels. This is reasonable, since the factors tending toward 
heterogeneity have had more time to operate by the time the 


Pupils reach the upper high-school grades. Part Iof the table is of 


particular interest, since it is the only instance in which there is 


* A more rigi othesis of homogeneous variance was applied to the 
Same data, po pep a by Neyman and Pearson. (“The Problem of 
K Samples,” J. Neyman and E. S. Pearson. Bulletin of the Polish Academy of Science, 
1931.) No attempt will be made to explain this test here. It will be sufficient to say 
that the test is very much like that described on page 99, and that according to this 


test, if the hypothesi riance were true, the probability of getting 
É thesis of homogeneous và е 
> distribution as heterogeneous as that observed is too small to be computed from 


available tables. 
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TABLE 10 


DISTRIBUTIONS OF VARIANCES OF SCORES ON VARIOUS EDUCATIONAL ACHIEVE- 
MENT TESTS, EACH COMPARED WITH DISTRIBUTION EXPECTED ON НҮРОТН- 
ESIS OF RANDOM SAMPLING 


A. Vocabulary Test (Part I, 
Test B, 1938 Iowa Tests 
of Basic Skills); 20 Sev- 
enth-Grade Pupils in Each 
of 89 Schools 


Variances f, 


B. Test of Work-Study Skills | C. Test of Language Skills 
(Parts II-VI, Test B, 1938 | (Test C of 1938 Iowa Tests 
Iowa Tests of Basic Skills); of Basic Skills); 20 Sev- 
20 Seventh-Grade Pupils in| enth-Grade Pupils in Each 
Each of 89 Schools of 87 Schools 


fi Variances fo Л Variances fo ft 

105.2— 3 o.89 518.0- I 0.89 4324.9— 3 0.87 
97.9— 2 0.89 482.1— I 0.89 4025.7— о о.87 
87.6— 5 2.67 431.4— 6 2.67 3602.3— 4 2.61 
701— г 445 389.4- 5 445 3251.0- 10 4.35 
69.5- 8 8.90 342.1— 7 8.90 2856.1— 6 8.70 
63.0- Іо 8.90 310.4— 6 8.90 2501.9— ^ 6 8.70 
53-3- 18 17.80 262.5- 18 17.80 2191.4— то 17.40 
44.6- то 17.80 219.7- 13 17.80 1834.6- 12 17.40 
39:9- 4 8.90 196.3- 12 8.90 1639.1- а 8.70 
33-9— 14 8.90 166.8— 8 8.90 1392.3— 14 8.70 
29.4- 3 4.45 144.8— 5 4-45 1209.0— 9 4-35 
24.9- 6 2.67 1226- I 2.67 1023.8 6 2.61 
22.2— o 0.89 109.2— 4 о.89 912.2— 2 о.87 
-22.2 4, 0,89 —109.2 2 0.89 —912.2 I 0.87 
89 — 89.00 89 89.00 87 87.00 


atest: .05 Ж PK то 
A-test: x = 4.78 


coc. a PROCU 
D. 1937 Iowa Every-Pupil 
Test in English Correct- 
ness; 30 Ninth-Grade Pu- 


X-test: зо < P < .so x?^-test: P < .oor 
-test: x = 2.52 
E. 1938 Iowa Every-Pupil 


F. 1938 Iowa Every-Pupil 
Test in English Correct- 


Test in American History; 


ess; ness; 3o "Twelfth-Grade| зо Pupils in Each of 99 
pils in Each of 107 High Pupils in Each of 85 High | High Schools 
Schools chools 
—————————— 
Variances f, tt Variances f, ft Variances fo ft 
TEC UE oU = min ENLACES MD. 
2343.4— 3 I.07 1047.0— I 0.8 102.2— 0.99 
2206.6— I I.07 985.9— 5 o3 102.9— 2 r£ 
201I.1— 2 3.21 898.5— I 2.55 93.8— 5 2.07 
1847.1— 5 5.35 825.3— 7 4.25 86.1— 3 4-95 
1660.6— 9  IO.7O 741.9— 7 8.50 77.4— 5 9.90 
I534.0- 16 10.70 685.4— 7 8.50 71.5— 7 9.90 
I330.1— 15 21.40 598.3— 12 17.00 62.4— 8 19.80 
I161.4— 25 21.40 518.0- то 17.00 54.1— 11 19.80 
1062.1— 7; . 10:70 474.59— II 8.50 49-5— 13 9.90 
034.2— 9  IO.7O 417.4— IO 8.50 43-5— 12 9.90 
936.8- 6 5.35 373-0—- 9 4.25 39.00 6 4.95 
736.0- 3 3.21 328.8— 2 2.55 34.3- II 2.07 
673.7- 2 I.07 301.0— I 0.85 31.4— 4 0.99 
—673.7 vr! 1.07 —301.0 2 0.85 —931.4 3 0.99 
IO7 107.00 85 85.00 99 99.00 


x?-test: ло < P < .20 
A-test: x = 2.72 


X^-test:.o02 < Р < .os 


x?-test: P « .oor 
A-test: x = 3.10 
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TABLE то — continued 


G. 1938 Iowa Every-Pupil | Н. 1938 Iowa Every-Pupil | I. 1938 Iowa Every-Pupil 
Test in Ninth-Year Alge-|  Testin Biology; 30 Tenth-| Test in English Correct- 
bra; 30 Ninth-Grede Pu- Grade Pupils in Each of| ness; 30 Ninth-Grade Pu- 

pils in Each of 94 High| 64 High Schools pils in Each of 122 High 

Schools Schools 


Variances fo ft Variances Јо Л Variances fo ft 
озы si acc a ER 

90.2— 7 0.04 156.7- 2 0.64 1320.0— 3 1.22 
85.0- 1 0.94 147-5— о o.64 1242.9— I 1.22 
77-4— 3 2.82 13447 3 1.92 1132.9— 4 3.66 
71.1 3 4.70 123.5— 5 3.20 1040.5— 5 6.10 
63.0- 6 9.40 Hro- 3 6.40 035.4- 12 12.20 
59-I— 9 9.40 102.6- 4 6.40 864.1- її 12.20 
51.6- 12 18.80 89.5- 13 12.80 754.3- 26 24.40 
44.7— 17 18.80 77.6- 14 12.80 654.2- 20 24.40 
qoum 5 9.40 71.0— 5 6.40 508.3- 13 12.20 
36.0- 14 0.40 62.5— 8 6.40 526.22 14 1220 
с bi 7 4.70 55-9- 4 3.20 471-4- 8 6.10 
28.3— 2 2.82 49.2— І 1.92 414.6- 3 3.66 
2»0- 5 0.04 45.0- I 0.64 379.5- 2 1.22 
725.9 E. O.04 —45.0 I 0.64 -379:5 9 I.22 
94 94-00 64 64.00 122 122.00 


ee Ol <P +< ge x?-test: .30 < P < .50 xi-test: 95 <Р < .98 
"est: y = 5.52 A-test: x = 1.82 


no evidence of heterogeneity. It should be noted, however, that 
for a similar test administered in 1937 (Part D of the table), a 
marked degree of heterogeneity was found at the same grade level. 
These data, then, leave little room for doubt that in the typical 
methods experiment we may expect to have some heterogeneity 
of Variance from school to school. It will therefore be of particular 
Interest to know the effects of such heterogeneity upon the validity 
of the F-test of significance employed in the method of analysis of 
Sections 4 to 6 preceding. This will be considered in the following 
Section. 


9. THE EFFECT OF HETEROGENEOUS VARIANCE WITHIN SCHOOLS 

UPON THE F-TEST OF SIGNIFICANCE IN METHODS EXPERIMENTS . 

To secure some quantitative description of the effect of hetero- 
Seneous variance within schools upon the validity of the tests of 
~ignificance employed in methods experiments of the type de- 
"bed in Section 4 of this chapter, the writer and Mr. R. H. 
Godard determined the actual distribution of F's for а very large 
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number of experiments of this type, and compared the actual 
F-distribution with that which theoretically should be obtained if 
all assumptions were satisfied." 

The first step in this empirical study was similar to the procedure 
described in the preceding section. The basic data were the scores 
made on Test A of the 1938 Iowa Every-Pupil Tests of Basic Skills 
by the sixth-grade pupils in 151 Iowa schools. This test was a 
55-minute objective test of silent reading comprehension. Fifteen 
pupils were selected at random from each school, and the variance 
of the scores of these 15 pupils was computed for each school sep- 
arately. The distribution of these variances showed some hetero- 
geneity, but not as pronounced as in some of the distributions pre- 
sented in the preceding section. Since a markedly heterogeneous 
distribution was desired, 47 schools of near-average variance were 
discarded. The distribution of variances for the remaining 104 
schools is given in Table тт. The theoretical relative frequencies 
(fi) are expressed as absolute frequencies in the f, column, and the 
corresponding observed frequencies are given in the f, column. 
The differences in the last column indicate that the observed 
distribution contains many more very large and very small vari- 
ances than would be found in random samples of 15 cases each. 
This degree of heterogeneity is perhaps more pronounced than 
would be found in most school Subjects (compare with the distribu- 
tions in Table то). д 

The next step was to divide the 15 pupils in each school into 
three groups of 5 pupils each. This division was made at random 
by the method described on page 28. One of these groups was 
then considered, for the purposes of this study, as having been 
taught by Method A, another by Method B, and the third by 
Method C, although actually, of course, all had been taught alike 
in each school. 


* The results of this study are reported in complete detail in an unpublished Mas- 
ter's thesis “The Effect of Heterogeneous Variance Upon Certain F-Tests of Signifi- 


cance,” by R. H. Godard. State University of Iowa, M. A. Thesis in Education, 
1939. 
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TABLE 11 
DISTRIBUTIONS OF OBSERVED AND THEORETICAL VARIANCES OF SCORES 
ОЕ SixrH-GnADE PUPILS on A TEST or SILENT READING COMPREHENSION 
IN 104 Iowa SCHOOLS (15 PUPILS FROM EACH SCHOOL) 


Vari Differences 
ariances =a 


328.108 and above 
302.572 — 328.107 
266.677 — 302.571 
237.167 — 266.676 
204.368 — 237.166 
182.649 — 204.367 


150.188 — 182.649 
121.837 — 150.187 
106.592 — 121.836 
87.710 — 106.591 
73.985 — 87.709 
60.440 — 73.084 
52.468 — 60.439 
below 52.468 


The next step was to divide the 104 schools into 26 random sets 
Of 4 schools each, again by the procedure described on page 28. 
New "random numbers" were then employed, and 26 different 
random combinations or sets of 4 schools each were secured. ‘This 
Was repeated until tooo random sets of 4 schools each had been 
Selected from the 104 schools. It is highly improbable that any 
One of these sets contained the same four schools as any other. 
The data for each of these тооо sets of 4 schools then constituted 
the basis for an analysis of variance in an “experiment” of the 
type described in Sections s and б of this chapter. Each of ше 
1000 “experiments” involved four schools and three “methods, 
With a total of 60 pupils, 20 under each “method.” For each 
experiment the variance for methods (M), for M x 8, and for 
within classes was computed. For each “experiment, also, 
ratios (F) between M and M X S, M and within classes, am 

X S and within classes were computed. It was thus possible 
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to make up a distribution of тооо observed F's for each of these 
three ratios. 

We shall first consider the distribution of observed F’s for the 
M and M X S variances, in which ratio we are, of course, most 
interested. In each " experiment," the d.f. for methods was 2 and 
for M X S was 6. Hence, according to Table 4, if all assump- 
tions underlying the F-test were satisfied, we should expect 5 per 
cent of the P's to exceed 5.14 if there were no real methods dif- 
ferences. In this case there can be no real methods differences, 
since there аге no real methods. Each of the so-called “methods” 
groups was selected at random from pupils that had been taught 
alike, and any differences in “methods” means could be due only 
to chance. Hence, if the known heterogeneity of variance did not 
disturb the F-test, we should expect 5 per cent of our 1000 F’s, or 
50 F's, to exceed 5.14. Actually, 64 of the observed F’s exceeded 
this value. Similarly, т per cent or то of our observed F’s should 
exceed 10.92, whereas 1 5 actually exceeded this value. The 20 
per cent point in the F-distribution, according to Fisher and 
Yates Statistical Tables, is 2.13 for 2 and 6 d.f. Hence we should 
expect 200 of our Ё?з to exceed 2.13, Whereas 215 actually exceeded 
this value. These data are summarized in the table below. 


F's (M/M X S) 


10.92 and above 
5.14 — 10.91 
2.13 — 5.13 

below 2.13 


If we compute x? for these ob 
we find a value of x? — 4.813. 
value of x? would be exceeded al 
а true hypothesis. 
of observed from th 
found about once in 


served and theoretical frequencies, 
For three degrees of freedom, this 
Most 20 per cent of the time under 
Hence, according to this test, this divergence 
eoretical frequencies is no larger than would be 
five in studies like this, even though the under- 
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lying assumptions were exactly satisfied in each. A more ade- 
quate or more rigid test than the X^ test would undoubtedly show 
that this divergence could not so readily be attributed to chance, 
but even so, the absolute divergence in this case is clearly not 
large. These data, so far as they go, are quite in agreement with 
the suggestion made earlier that real differences in variance from 
School to school will not seriously affect the validity of the test 
of significance of methods differences based on the ratio of the M 
and M X S variance. 

The findings for the other two ratios are of a somewhat different 
character. For the ratio of the M X S and within classes vari- 
ances, the distributions of observed and theoretical F's are as 
follows (6 and 48 d.f.): 


F's(M X S/ within classes) 


13.20 and above 
2.30 — 13.19 
I.4I — 2.29 
below 1.41 


For the ratio of the M and within classes variances, the distribu- 
tions are (2 and 48 d.f.): 


F's (M/within classes) 


5.08 and above 


3.19 — 5.07 
1.67 — 3.18 
below 1.67 


In each case, it may be shown by the X? test that the divergence 
ar Observed from theoretical frequencies is significant far beyond 
the z Der cent level. There is little question, then, that hetero- 
8eneity of variance within schools will result in a considerably 
larger Proportion of high F's for these tests of significance than 
the Proportions given in the table for F. Even so, the divergence 
Of observed from theoretical frequencies is not large enough to 
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render the F-test completely meaningless for these ratios. Even 
with the very marked degree of heterogeneity produced in the 
data used in this study, one would not go far wrong, for most 
practical purposes, in using the table for F to evaluate the signif- 
icance of the F obtained for these variances. It should be ob- 
served, however, that these data are by no means conclusive. It 
might be that in experiments involving more schools, and hence 
with larger dfs for M X S and within classes, the divergence 
would be less marked, or it may be that the validity of the 
F-test would be even more seriously affected. However, the 
tentative conclusion seems justified that in the typical methods 
experiment one may safely employ the F-test as an approxi- 
male test of the M X S/within classes ratio, particularly if one 
insists on a high level of significance before generalizing from the 
results obtained. 

It is very important to observe that these data, even though we 
assume that they are generally representative, do not completely 
establish the validity of the F-tests in experiments of the type 
with which we are here concerned. In this study, only the effect 
of real differences in variability of pupil scores from school to 
school was considered. Nothing has been shown about the effect 
of possible differences in variability from method to method. If it 
should happen that certain methods produce g 


than others, either in Pupil scores within the s 
“corrected” 


reater variability 
ame school, or in 
class means from method to method, there is again 
the danger that the F-test will be seriously invalidated. This 
latter possibility seems relatively remote, since the differences 
found between mean Scores on the criterion test in most methods 
experiments have been quite small in relation to the variabili- 
ties of the distributions. It seems unlikely, therefore, that 
methods incapable of producing large differences in central tend- 


ency would produce sufficiently large differences in variability to 
disturb the F-tests seriously. 
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IO. GENERAL POSSIBILITIES OF ANALYSIS OF VARIANCE 
IN EDUCATIONAL RESEARCH 

The preceding illustrations have dealt only with “methods” 
and “schools,” since it is perhaps in the instructional experiment 
that analysis of variance will find its most important single type of 
application in educational research. There are many other types 
of situations, however, in which the methods of analysis of vari- 
ance may be used to great advantage by the research worker in 
education. A few of these situations will be suggested in the 
following discussion; the purpose of the discussion, however, is 
Dot to exhaust these possibilities, but to stimulate the student to 
discover many other possible applications for himself with the 
aid of the suggestions offered. 

The simplest of the methods thus far discussed is that involving 
Only one classification (such as methods), and in which the hypoth- 
esis to be tested is that the differences in means among the vari- 
ous groups in this classification are due only to chance fluctuations 
In random sampling. This method of analysis may be applied 
either to experimental or observational data. (Observational 
data are those obtained from investigations of existing popula- 
tions, i.e., not derived from experiments.) s 

One illustration of this method of analysis as applied to experi- 
mental data was given in Section 2 (pages 93 fi.), the purpose of 
the experiment being to evaluate the effects upon achievement of 
Certain methods of instruction. The same general procedure 
Could be followed in any experiment of the same general design. 

9r example, it could be used in experiments intended to discover 
the telative effects of various sizes of type upon reading rate, or of 
different amounts (or distributions) of drill upon retention, or of 
various types of propaganda upon attitude toward a social issue, 
Orof different environmental conditions upon measured intelligence, 
9r of different diets upon body weight, different lighting conditions 
Upon eye-strain, different drugs upon sense discrimination, etc. 
In other words, the term “treatments” may include any series of 
Variations in any factor which may influence any type of measur- 
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able performance or any trait. The term "individuals, simi- 
larly, is obviously not restricted to school pupils, but may in- 
clude such units as libraries in an experimental study of use of 
library facilities, or schoolrooms in a study of ventilating con- 
ditions, or rats in an animal-learning experiment, etc. With 
these suggestions, then, the student should be able to sup- 
ply any number of further illustrations of the use of simple 
analysis of variance in experimental situations in his own field 
of interest. 

The same method of analysis is appropriate for observational 
data when each of the groups compared may be considered as a 
random sample of a specified type or class of individuals. For 
example, one might evaluate differences in the mean achievement 
in high-school physics of pupils grouped according to amount of 
previous training in mathematics. Again, one might select 
samples of individuals from different vocational groups and evalu- 
ate differences in the mean intelligence of the samples. Further 
examples would be found in investigations purporting to evaluate 
differences in: mean expenditure per pupil in schools of various 
types; average grade-points earned by college freshmen coming 
from different high schools; mean achievement of pupils with 
different vocational interests; mean life of brooms supplied by 
different manufacturers; mean tenure of teachers with different 
amounts of professional training; mean amounts of training for 
teachers in different counties or states; mean persistence in school 
of pupils from families in different income brackets; etc. It is in 
situations like these that the feature noted on page 102 — that the 
method is applicable whether or not the groups are equal in size — 
is of most value. 

The second general type of design, in order of complexity, is 
that in which there are two classifications, but in which it is desired 
to evaluate (i.e., determine the significance of) differences in group 
means only within one of these classifications. The second or 
cross-classification is introduced merely in order to increase the 
precision of the major comparisons or to permit a valid estimate 
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of error. In other words, there are in this design only two major 
sources of variation (aside from chance), and only one of these is 
to be evaluated, the other is to be equalized in the major com- 
parisons and its effects eliminated from the error estimate. In 
the illustrations already given of this design (Sections 4 to 6 of this 
chapter), these major sources of variation were methods and schools, 
methods being the factor to be evaluated and schools that to be 
equalized. In more general terms, this design involves a number 
of homogeneous groups (of which schools is only one example), 
each of which is divided into a number of equal or proportional * 
subgroups with reference to the major classification. In agricul- 
tural research, in which it was first widely used, this design is 
known as a “randomized blocks" design; hence, in educational 
research, it seems appropriate to refer to it as а randomized 
groups" design. 

The randomized groups design, 
appropriate only in experimental situations. 
of a number of duplicate experiments, each of which has been 
performed independently with a relatively homogeneous group. 
The accompanying analysis 1s therefore essentially a means of 
evaluating the pooled results for a number of duplicate experi- 
ments. In educational research, each of the homogeneous groups 

of pupils under the same teacher, 


would most frequently consist f 
or in the same school, or in the same community, and hence the 
d in different classes, or in 


basic experiment would be duplicate 

different schools, or in different communities. In such cases, the 
experimenter finds the groups already organized for him, and 
since they must be left intact, he has no choice but to take group 
differences into consideration in bis analysis if he is to have a valid 
estimate of error. Quite often, however, the homogeneous 


groups are organized by the experimenter, especially for the pur- 


as here defined, is in general 
It consists essentially 


1 The interaction or “remainder : 5 mi 
(see page 106) only if the sizes of the subgroups withi 
in the same proportion for all homogeneous groups. 
then only a special case of proportional subgroups. 
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poses of the experiment, to secure increased precision. For ex- 
ample, he might organize the pupils into groups of the same 
age, or same level of intelligence, or same level of achievement, 
and in effect duplicate the basic experiment with each level 
or group. In other words, the various “treatment” classes 
would be “equated” with reference to age, or intelligence, or 
achievement. 

It may be well to consider in some detail the procedure to be 
followed when the homogeneous groups are organized by the 
experimenter, rather than simply identified and used byhim. Sup- 
pose, for example, that the object of an experiment is to determine 
the effect of color of light upon the accuracy with which a worker 
performs a certain task. Our plan is to have one set of workers 
perform the task with a blue light, another set with a yellow light, 
etc., and to compare measures of accuracy for the various light 
conditions. Suppose 5 light conditions or “treatments” are to be 
tested. If we are to use the randomized groups design, the homo- 
geneous groups would have to be specially organized for the pur- 
poses of the experiment. In selecting or organizing these groups, 
our object would be to select the workers in each group so that 
their criterion measures would be as nearly alike as possible. We 
must then group them on the basis of some measure which is as 
highly related as possible to the criterion of accuracy. The most 
obvious way of doing this would be to test first the accuracy of 
each individual when all are working under the same light con- 
ditions. We would then rank all workers in order of these initial 
measures, and let the first five constitute group or “level” 1, the 
second five level 2, etc. We would then assign the workers in 
each level one to each of the five treatments at random. If the ` 
number of levels is not too small, the workers under the various 
treatments would then be closely “equated” with reference to our 
initial measure (although not on a man-to-man basis), and hence 
the precision of the treatment comparisons would be increased. 
The analysis of the criterion measures would then be made in 
the manner of Section 4, pages 104 ff. (with only one worker in each 
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subgroup). Assuming 20 “levels” or тоо workers, we should have 


EM às 


Treatments 4 
Levels 19 
Treatments X Levels (Error) 76 
Total 99 


By substituting other “treatments” for “color of light,” and levels 
of intelligence, or of initial achievement, or of chronological age, 
etc., for “levels of accuracy,” the student should be able to supply 
many other illustrations of possible applications of this procedure. 
It may be noted, however, that the methods of analysis of covari- 
ance (Chapter 6) will suggest a way of securing equal precision in 
an experiment of this kind without having to go to the trouble of 
arranging the individuals into “levels” in advance. 

It is not necessary to have the same number of individuals in 
each “level” as there are treatments to be evaluated. For example, 
in an experiment involving 210 pupils, we might divide the pupils 
into just ten levels, say on the basis of scores on a general intelli- 
gence test, 21 pupils at each level. Suppose that three "'treat- 
ments" are involved, and that the pupils in each level are divided 
at random into three subgroups of 7 pupils each, one subgroup for 
each method. We would then have 


Treatments 
Levels 


Treatments X Levels 
Within subgroups 
Total 


One feature of this design deserves special consideration. Ordi- 
narily, it is not appropriate to use an interaction variance as an 
error variance unless one of the effects involved in the interaction 
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is a random effect (in the M X S variance of Sections 4 to 6, 
“schools” is a random effect, i.e., the schools were selected at 
random from a population of schools). In the designs just con- 
sidered, however, we may justify the use of the treatment x levels 
interaction as an error term in a special sense. The justification 
of this procedure is given in the following paragraph. 

Suppose, first, that the treatments X levels variance is significantly 
larger than the within-subgroups variance. This suggests that the 
relative effectiveness of the treatments depends upon the level of 
intelligence of the pupils to which they are applied. Hence, even 
though the treatments variance proved significantly larger than the 
interaction variance, we could not necessarily recommend one 
treatment for use at all individual levels. We could, however, 
recommend it for use with any other sample made up of the same 
levels of intelligence as the sample used in the experiment, i.e., 
with any sample drawn at random from the same population as 
our sample. The levels of intelligence which 
it is true, constitute a sam 
large number of such levels, 
represented in our sample tha 
and the fact that the éreatmer 
interaction variance is evide 


we have used do not, 
‘ple, random or otherwise, from any 
On the contrary, we have all levels 
t are represented in the population, 
il variance significantly exceeds the 


nee that certain treatments work 
better than others at most levels, even though they do not for all. 


Of course, if there is no significant interaction, or if the interaction 
variance is less than the within-subgroups variance, we would use 
the latter as error. 

In general, then, when the “homogeneous groups” are organized 
especially by the experimenter by dividing the individuals in a 
sample into “levels” with reference to some continuous variable, 
so that all possible levels are represented, the interaction variance 
may be used as an error term for the purpose of generalizing about 
the population from which the total sample was drawn. It should 
be noted, also, that while this design may be duplicated in a num- 
ber of randomly selected schools, we would not in that case intro- 
duce “levels” into our analysis. As was noted on page 129, the 
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M X S variance is still valid as error even though the groups have 
been matched within each school. 

The student may have observed that when the homogeneous 
groups are already organized and must be left intact (as when they 
are school classes), it is not always feasible to divide each of them 
at random into subgroups which are equal or proportional in size 
and which may be physically separated for experimental purposes. 
This is a serious obstacle in some types of research, but it must not 
be overlooked that many of the "treatments" we may wish to 
compare may be simultaneously administered to all of the sub- 
groups in each of the homogeneous groups. For example, we may 
wish to determine the relative effects upon reading rates of four 
sizes of type, and may have to work with a sample of several school 
classes, each of which must be left intact. To do this, we might 
prepare four editions of the same rate-of-reading test, each in one 
of these sizes of type. We could then make up a pile of tests for 
each class, each pile containing an equal number of each of the 
four editions. We would then randomize the papers in each pile 
separately. For each class, we could then hand out the tests (in 
the random order in which they appear in the pile) to the pupils, 
and then administer all tests simultaneously. The pupils in each 
class would then be divided into four random subgroups of equal 
size, but the procedure would involve no disturbance of the ad- 
ministrative organization of the school. The precautions about 
randomizing the papers are not essential, since the treatments X 
schools interaction is to be used as error, but it will avoid any 
larger-than-chance differences between subgroups that might 
otherwise increase the interaction variance and thus lower the 
precision of the experiment, and it would make possible an evalua- 
tion of the interaction. Other illustrations of “treatments” 
which may be simultaneously administered should readily occur 
to the student. * 

The method of analysis (рр. 119-127) used with “randomized 
groups" designs is not appropriate for direct application to obser- 


vational data, since with such data the subgroups would not 
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ordinarily be proportional in size from group to group. For 
example, if we wished to evaluate the difference in mean achieve- 
ment of boys and girls in a number of Schools, we would almost 
certainly not find the ratio of boys to girls to be the same for all 
Schools. As a result, the difference in general means for the sexes 
(and also the differences in school means) would in part depend 
upon the way in which the sexes were distributed in the various 
Schools. For instance, suppose that the scores on an achievement 
test administered in three schools gave the following totals and 
means for boys and girls separately: 


School x School 2 School 3 Total Sample 
» T M a п т M 


Воуѕ 20 380 тоо то 


Io 2 25. o Зоо 20.0 
Girls IS 273 18.2 DS ae d 


20 490 24.5 5° 1000 20.0 


Tro о.о 


Difference + o.8 


This, of course, is due to the 
general level of achievement is 
ains the highest proportion of 
хез is obviously not measured 


analysis of Section 6 (pages 119-127) would therefore not enable 


Because of the disproportionate 
mpossible to compute the variance 


(sex X schools) variance by the remainder theorem (page 107). 


* has presented empirical evidence indicating 


numbers are not too disproportionate, use- 


fully accurate results may be secured by applying the ordinary 


methods of analysis of variance (Section 6, РР. 119 ff.) to adjusted 


mbers for Tables of Multiple 


," Journal of American Sta- 
РР. 389-393, December, 1934. 
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or “expected” numbers and sums instead of to the actual numbers 
and sums for the subgroups. Since this procedure appears to 
have considerable value in educational research, it will be de- 
scribed in detail in relation to a concrete example. 

We shall illustrate this method as applied in an analysis of sex 
differences in performance on a high-school language test adminis- 
tered in a number of schools. The following table summarizes 
the results obtained from the administration of the 1939 Towa 
Every-Pupil Test in English Correctness in seven high schools ran- 
domly selected from all schools below roo in enrollment that 
participated in the 1939 Iowa Every-Pupil Testing Program. 


Boys Girls Schools 
"5 T M n T M п T M 


School 1 12 2097 174.7500 19 3228 169.8047 31 5325 171.7742 
School2 9 1145 127.2222 9 1455 101.6007 18 2600 144.4444 
School 3 то 1705 170.5000 14 2620 187.7857 24 4334 180.5833 
School4 5 839 167.8000 13 2348 18060154 18 3187 177.0556 
Schools 8 1174 146.7500 19 3302 173.7805 27 4476 165.7778 
Schooló тт 2036 185.0909 16 2745 171.5625 27 4781 177.0741 
School 7 6 841 140.1667 7 1173 167.5714 13 2014 154.9231 
бт 9837 161.2623 97 16880 174.0206 158 26717 169.0949 


We note, as might be expected, that the ratio of boys to girls dif- 
fers considerably from school to school, and that hence the methods 
of Section 6 (pp. 119 fl.) may not be applied directly to these data. 

The first step in the procedure is to adjust the “class” numbers 
(each “class” consisting of pupils of the same sex in the same 
school) so as to make them proportional. These “expected” num- 
bers are computed in exactly the same way as we computed the 
“theoretical” frequencies in a contingency table in applying a test 
of independence or homogeneity (pp. 43 #.). For example, the 
“expected” number of boys in School 1 is (61 X 31)/158 = 
11.9684. The expected numbers thus computed are given in the 
table on the following page, together with the actual numbers. 
'The sum of expected numbers of course exactly equals the sum 


of actual numbers for any row or column. 
'The second step is to apply the X^ test to the table of actual 
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Boys Girls 
Actual Expected Actual Expected 


School 1 12 11.9684 19 19.0316 
School 2 9 6.9494 9 11.0506 
School 3 то 9.2658 14 14.7342 


School 4 6.9494 13 11.0506 
School 5 8 10.4240 19 16.57бо 
School 6 10.4240 16 16.5760 
School 7 Р 5.0190 7 7.9810 
Total 61.0000 97 97.0000 


and expected frequencies. This is essential, since if the dis- 
crepancy between actual and expected numbers is larger than 
could be attributed to random selection, the procedure later de- 
Scribed is not likely to give accurate results. In this case," the 
value of Ж? is 
X = (12 — 11.9684)*/11.9684 + ...... ЧЕ 
(7 — 7.9819)/7.9810 = 3.254. 

Since this X? has 6 df., its value is no larger than could result by 
chance selection (.8 > P > 7). Had X proved highly significant, 
we would not be justified in continuing with this procedure. 

The third step, granting that the X? test is not significant, is to 


compute an expected swm for each class by multiplying the actual 


For example, the expected sum 
X 11.9684 — 2094.623. А new 


numbers and sums and actual 
class means is then prepared, as below. 


a 174.7500 10.0316 3233.368 160.80. 8 
6.9404 884118 1272 ^r 7 3I 5327.001 171.8707 
932688 туо вто 127222 11.050 тубат 160000] 18 2670.632 148.3684 


І 

2 

3 

4 6.9494 1160-109 167. Boon sie 2766.872 187.7857 24 4346.691 тёт.тт21 
5 

6 

7 


1995-909 180.61 18 E «66 
19040 AES 146.7500 16.5760 2880.735 173799 27 210018 RUM 
EEG 7929388 1850009 16.5760 2843.820 171565 27 4773.208 176.7855 
10199 703.407 140.1667 төй 1337.387 167.5714 т; 2040.884 156.9911 
97.0000 16844.605 173.6557 158 26731.88: 169.1891 


61.0000 0887.276 162.0865 


"Ina 2X n table, x? is most easily computed by the method described on page 44. 
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The marginal totals in this table were secured by adding the ex- 
pected sums within the body of the table. The marginal means 
were secured from the marginal totals. 

The fourth step is to compute the variances for sex, schools, and 
sex X schools from this table of adjusted data. Since class and 
marginal means are available, it will be more convenient * to com- 
pute the sums of squares for classes, sex, and schools by means of 
formula (18a), page 92, than of (18). In other respects the compu- 
tational procedure is the same as that described on pages 120-122. 

The fifth step is to compute the sum of squares for within 
classes from the table of original data, by subtracting the sum of 
squares for classes from that for total. In the example, the sum of 
squares for total (from unadjusted data) was 126146.58, and that 
for classes (also from unadjusted data) was 35680.16, leaving 


90466.42 as the sum of squares within classes. 
The results for the example are summarized in the table below. 


Sum of 
Squares Variance 


Sex 3009.71 3009.71 
Schools 16606.33 2767.72} From adjusted data 


Sex X Schools 12258.27 2043.05 


Within Classes 90466.42 628.24} From original data 


The results may now be interpreted as in a randomized groups 
experiment. We note first that the variance ratio for sex X 
schools and within classes is F = 2043.05/628.24 = 3.25. This 
ratio, for 6 and 144 df., is significant well beyond the 1 per cent 
level. This indicates that the true sex difference varies from school 
to school, hence it is not appropriate to test sex against within 
classes if we wish to generalize about all schools. We note, how- 
ever, that for this particular set of seven schools, girls seem superior 

1 For example, the sum of squares for schools is 


(5327.991 X 171.8707 + -**** 4- 2040.884 X 156.9911) — 26731.881 X 169.1891 
= 16606.33 
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to boys in the trait measured, since F — 3009.71/628.24 — 4.79 
(т and 144 d.f.) is significant beyond the 5 per cent level. If we 
wish to generalize about all schools, we may, on the assumption 
that these seven schools are a random sample from the population 
of schools, test sex against sex X schools. This variance ratio, 
F = 3000.71/2043.05 = 1.47 (1 and 6 d.f), is not significant. 
Hence these results are consistent with the interpretation that in 
the population of schools boys and girls are equal in mean achieve- 
ment, but that in some schools there аге real sex differences, and 
that we just happened in this small sample to get a majority of 


This interpretation may 
nds, but there is nothing 
€ us to reject it with any 
that there are real differences 
= 2767.72/628.24 = 


es is homogeneous. For example, 
of scores on an attitudes scale for 
ssified with reference to both reli- 
party affiliations. We would first 


roughly the same for the different 
test this assumption by the x? test. 


We would then Prepare a table of adjusted numbers and totals for 


the subgroups, and would i 
preceding illustration, religious affiliations takin 
and political affiliations that of schools. 


nevertheless interpret the 
levels on pages 148-150. 
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If the investigation is such as to render the latter type of interpre- 
tation inappropriate, the type of interpretation suggested for 
factorial designs (pages 163-173) might be applied. 

Before concluding the discussion of analysis of two-way classi- 
fications with disproportionate subgroup numbers, it may be 
well to emphasize that the method of "expected" numbers is an 
approximate method only, and may safely be employed only when 
it seems reasonable to suppose both on a priori grounds and in 
consideration of the results of a X^ test, that in the whole popula- 
tion the subgroup numbers are proportional, or nearly so. 1f 
this assumption seems unreasonable, or if the X^ test proves sig- 
nificant, other methods * of analysis are available, but these are 
beyond the scope of this book. 

We noted in Chapter I (pages 5 and 6), that the use of “con- 
trolled” and "stratified" samples appears to offer very important 
(but hitherto largely neglected) possibilities in educational and 
psychological research. If we are to use such samples, it is of 
course important that we have unbiased objective measures of 
the precision or reliability of the results obtained from them, and, 
particularly, that we be able to secure an unbiased estimate of 
the standard error of the mean of a sample of this type. One 
solution to this problem is directly suggested by the methods of 
analysis of variance. 

Suppose, for instance, that we are studying some trait in which 
it is known or suspected that there are systematic sex differences 
in the population to be sampled. Assuming that these differences 
concern only the means of the sexes, and not their variabilities, 
we might draw our saıaple so that it contains an equal number of 
each sex, rather than allow chance to determine the proportion of 
sexes in our sample. In other words, our sample would consist of 
two random samples of equal size, one from each sex, and the total 
sample would be a controlled sample rather than a simple random 
sample. We could now compute the standard error of the weighted 


„ * See Yates, F., “The Analysis of Multiple Classifications With Unequal Numbers 
in the Different Classes," Journal of American Statistical Association, Vol. 29, pp. 517 
66, March, 1934. 
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mean of the total sample by analyzing the total variance into 
between sexes and within sexes, divide the latter variance by the 
total number of cases, and extract the square root of the result. 
The d.f. for this standard error would be the same as for the within 
sexes variance, or № — 2. This standard error would indicate the 
variability in the sampling distribution of the means of a large 
number of similar samples, all of which had been “controlled” in 
exactly the same fashion. It is important to note, however, that 
this procedure assumes homogeneous variance within sexes. 

It may be well to indicate in more general terms the procedure 
that may be followed in evaluating the mean of a controlled or 
representative sample. Suppose that for a given population we 
are interested in a variable x, which is known to be related to а 
variable y. Suppose that the distribution of y in the whole popu- 
lation is already known, and that the whole population may there- 
fore be divided into definite classes or categories with reference to y. 
If the relative frequencies in these categories for the population 
are not exactly known, we may assume certain relative frequencies 
on the basis of available information and subjective opinion. We 
will then select our sample so that the relative 
y categories conform to the known ога 
in the population. 
sample from each c 


frequencies in the 
ssumed relative frequencies 
If we may assume that we have a random 
ategory, and if we may also assume homo- 
geneous variance in х within these categories, we may compute 
the standard error of the observed x mean for the total sample. 
The procedure, as already Suggested, would be to analyze the 
total variance of the x distribution into between categories and 
within categories. The Square root of thz quotient obtained by 
dividing the latter variance by the total number of cases would 
be the standard error of the observed z mean. If the relative 
frequencies in the categories has been assumed, rather than known, 
this procedure of course introduces the danger of bias. In the 
latter case, however, the standard error is still vali 


to a hypothetical population with the same dist 
we have in our sample, 


d with reference 
ribution of y as 
Where the true y distribution is not 
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known, therefore, any generalizations based on the sample should 
be confined to this hypothetical population. Again, it should be 
emphasized that this procedure assumes homogeneous variance 
within categories — an assumption which should always be care- 
fully examined in light of the observed data. 

To illustrate the procedure just described, suppose we wish to 
know the mean annual expenditure per pupil in rural one-room 
schools in the state of Iowa. Suppose we know or suspect that 
there are systematic differences from one part of the state to an- 
other. We could then divide the state into geographical districts, 
and select a random sample of schools from each district. The 
number selected from each district would be proportional to the 
known total number of schools in that district. “Districts” 
would then constitute our y categories. We would then determine 
the expenditure per pupil in each school in the total sample, and 
analyze the total variance of the distribution of these data into 
between districts and within districts variances, in the manner of 
Section 2 of this chapter (see particularly the Nore on page 102). 
We would then divide the latter variance by the total number of 
schools in our sample, and extract the square root of the result, 
to determine the standard error of the general mean. The d.f. 
for this standard error would be the number of schools minus 
the number of districts (the d.f. of the within districts variance). 
This standard error would validly describe the reliability of 
our mean if the variance in per-pupil-expenditures were funda- 
mentally constant from district to district. If the differences in 
mean per-pupil-expenditure from district to district were marked, 
the mean of this type of sample might be considerably more reli- 
able than the mean of a simple random sample selected from the 
state at large. 

By way of further illustration, and also to exemplify the compu- 
tational procedure, suppose we have administered a test of informa- 
tion about contemporary affairs to a sample of students in a certain 
university, this sample having been made representative with 
respect to the distribution by departments and colleges, such as 
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Law, Medicine, Engineering, etc. Specifically, the sample has 
been taken so that the number selected from each department is 
proportional to the total number in that department in the whole 
university. Suppose the results are as summarized below (the 
last column containing the sums of squares of individual scores). 


Department 
or College n JT M Ze 


fr 27 2009 74-4074 179925 
#2 13 709 54.5385 47411 
#3 18 810 45.0000 49840 
#4 22 1757 79.8636 157987 
its _39 2877 73-7692 236103 
Total II9 8162 68.5882 671266 


The sum of squares for total is 


671266 — 8162 x 68.5882 — II1449.1I, 
and for departments is 


(2009 X 74.4074 + +. + 2877 X 73.7692) — 8162 X 68.5882 
= 17339.71, 

leaving 94109.40 as the sum of Squares within departments. The 
variance for within departments is then 94109.40/114 = 825.521. 
Accordingly, the estimated standard error of the general mean is 
V 825.521/119 = 2.633. Hence, since for 114 d.f. the value of 
t at the 1 per cent level is about 2. 58, we may be highly confident 
that the sample mean does not differ from the population mean 
by more than 2.58 x 2.633 — 6.79, or that the population mean 
lies between 75.38 and 61.80. It is worth noting that had we 
considered the total sample as a simple random sample, we would 


haveestimated the standard error of the mean as V/11 1449.11/119? 


= 2.805. Hence, our control resulted in an appreciable increase 


of this random sample would tend to be less than 2. 


sampling is therefore not as great as seems indicated by the comparison of 2.58 
with 2.805. 
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in precision. The fact that the control was worth while is also 
demonstrated by the fact that the variance for departments, 
17339.71/4 = 4334.85, is significantly larger (F = 4344.85/825.51 
= 5.26) than that within departments. 

It is important to note that the procedure just described as- 
sumes homogeneous variance within groups (see test on page 99, 
footnote). If this assumption cannot be made, the preferred 
procedure is that described in the following paragraphs. 

Suppose we have a sample whose members have been classified 
into certain categories with reference to which it is possible to 
classify all members in the population. Suppose also that the 
members in each category may be considered a random sample 
from all such members in the population, but that the numbers in 
the categories of the sample are xot proportional to those in the 
population. This would be a “stratified,” but not a “representa- 
tive" sample. Suppose, further, that we know the numbers in 
these categories in the whole population, or know that they are 
in a certain proportion. We may then secure from this stratified 
sample an unbiased estimate both of the population mean and of 
the standard error of this estimate. How this may be done may 
be illustrated with the data used in the preceding example. 

Suppose that this sample had been drawn by taking any con- 
venient number (7,) of students at random from each department 
separately, but with no attempt to make these numbers propor- 
tional to those in the population. Suppose, however, that we know 
that in the population (that is, in the whole university) 35 per cent 
of the students are enrolled in department #1, 15 per cent in #2, 
8 per cent in #3, 20 per cent in #4, and 22 per cent in #5. The 
formula for estimating the puppes pu is then 

'M, af able tes ФМ + + mM, 
est’d My = =" ni + n} + tm. n, 
ZnjM, 3 
En 
in which r is the total number of categories, M, is the observed 
mean of category p of our total sample, and a, %4, ++: m, are 
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numbers which are proportional to the numbers in these categories 
in the whole population. (Similarly, 7,, 2,, etc. denote the actual 
numbers in the sample.) In our example, then, the estimate of 
the population mean is 

s .5385 X sere -7692 X 22 
est'd My, 74-4074 X 35 + 54.5385 ES rj: + 73-7692 


= 70.0253. 

The standard error of this mean may then be secured by first 
estimating the variance of the observed mean of each category, 
using formula (8) on page т (but noting that the 7 in the formula 
refers to the actual number in the category of the sample). For 
our example, since È d? for department #1 is 179925 — 74.4074 X 
2009 — 30440.54, the estimated variance of the observed mean of 
department #1 is s? = 30440.54/(27 X 26) = 43.3626. The esti- 
mated variances of the means for the other categories are 56.0462, 
43-7582, 38.2395, and 16.1050, respectively. Each of these vari- 
ances is then multiplied by the Square of the weight used with the 
mean, and the sum of the products is divided by the square of the 
sum of the weights. The result is the estimated variance of the 
weighted general mean. If 5*, represents the estimated variance 
of the mean of category p and if n’, has the same meaning as before, 


then 
ES 
est'd ош = zh 2r 
(Z nj) 
For our example, 
est'd og = 397€ 43-3626 + 15? X 56.0462 + ·.... + 227 X 16.1059 
Too? 


= 9.1621 
and hence the estimated standard error of 
root of this result, or 3.027. 
than in the first example giv 


the mean is the square 
This mean is of course less stable 


: еп, since we have given relatively 
little weight to the most stable group mean (that for department 


#5), and relatively heavy weight to some of the less stable. This 
last estimate (3.027) is an unbiased estimate under the conditions 
given, but a better estimate of the population mean (i.e., a smaller 
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standard error of the mean) would have resulted from the same 
size (119) sample if the category numbers in the sample had been 
proportional to those in the population. The possibility of secur- 
ing an unbiased estimate of the standard error of the mean of a 
stratified but not representative sample therefore does not reduce 
the desirability of making the sample representative whenever 
possible. We may note again that the procedure last described 
is valid even if the variances within groups are heterogeneous. 
Hence, even though the sample is representative, this latter pro- 
cedure may sometimes be the appropriate one to use. 

In concluding this discussion of general possibilities of analysis 
of variance in educational research, it may be well to repeat that 
the foregoing has been intended to be suggestive only. There 
are many other variations of these methods which have not been 
presented here. (One of these, the factorial design, will be given 
special consideration in the next two sections, and an extension of 
these methods, known as analysis of covariance, will be presented 
in Chapter VI. Other variations, such as the Latin square design 
and the Graeco-Latin square design, may be found described in 
Fisher’s Design of Experiments.) Thus far, our experiences with 
these methods in the field of education have been extremely limited 
— so much so, in fact, that the writer has been forced to use many 
hypothetical illustrations. As our experience accumulates, it is 
probable that many of the suggestions here made will be in need 
of revision, or almost certainly of a redistribution of emphasis, and 
the student is therefore urged to retain a highly critical attitude 


in relation to them. 


II. SIMPLE FACTORIAL DESIGNS IN METHODS EXPERIMENTS 


It may sometimes be desirable to design a methods experiment 
so as to permit an evaluation, not only of certain methods, but 
also of certain variations in procedure which may be tried with all 
of these methods. For example, in addition to evaluating several 
ways of distributing drill in arithmetic, we may wish to evaluate 


several types of drill materials (each of which may be used in con- 
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junction with any of the distributions of drill) and may wish also 
to determine which particular combination of type and distribu- 
tion of drill is most effective. If we followed the procedure which 
` has heretofore been customary in educational research, we would 
experiment separately with the distributions and the types of drill, 
each experiment being of the familiar "single variable" type. 
However, we can often serve the same ends much more efficiently 
by employing analysis of variance with what is known as a factorial 
design — a factorial design being one that permits comparisons 
of two or more factors in all combinations. 

For the purposes of concrete illustration of the analytical pro- 
cedure to be followed in factorial designs, let us suppose that we 
have planned an experiment to determine the relative effects 
upon reading rate of three different sizes of type and four different 
styles of type (such as Old English, Gothic, etc.). We thus have 
twelve possible combinations of style and size of type. Suppose 
we have accordingly prepared twelve different editions of the 
same rate-of-reading test, one in each of these combinations. 
Suppose our experiment has involved 120 pupils (all in the same 
School) and that we assigned these pupils at random to twelve 
equal groups, ten pupils per group. All groups were tested simul- 
taneously and under the same conditions, but each took a different 
edition of the reading test. Suppose the results are as given in 
the table at the top of the opposite page, the measures being the 
number of words read per unit of time. In the following tables 
the Roman numerals represent the styles of type and the capital 
letters the sizes. Thus, in the group which was presented with 


ad 225 words per unit of 


will have no effect on the final 
The totals for the twelve groups 


* The student is strongly urged to check these results as an exercise. 
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StyleI Style II Style III Style IV 
A B с А В с А В c A B € 


169 162 IQI 225 228 175 2IS 231 222 204 210 
169 156 185 213 198 167 197 219 202 201 217 
168 142 177 211 175 154 185 19r 174 195 207 
162 137 176 208 167 153 184 188 171 174 196 
ISI 137 157 165 152 І5І 176 172 170 172 187 
ISI I3I 146 163 148 138 168 165 164 168 181 
I4I 127 132 163 144 133 153 162 157 165 174 
132 123 132 153 129 132 150 159 147 162 168 
IIQ 108 I31 142 125 I21 146 157 145 160 152 
IIS IOS 127 141 114 IIS 120 150 II6 152 150 


and the general totals and means for sizes and styles are given 
below (in terms of reduced scores). 


Size 
I п ш IV "Totals 


348 554 439 668 2000 
477 784 694 753 2708 
328 580 794 851 2553 
Style Totals 1153 1918 1927 2272 
Style Means 38.43 63.93 64.23 75-73 


Variances 


3360.1 
7446.9 
1020.3 

668.5 


Size 

Style 

Size x Style 
Within Groups 


The variance ratios in which we will be particularly interested 
are as follows (the numbers in parentheses after each F being the 
values needed for significance at the 5 per cent and т per cent 
levels, respectively, for the given df): ` 

Size/Within Groups: F= 5.04 (3.09, 4.82) 
Style/Within Groups: Е = 11.14 (2.70, 3.98) 
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Size X Style/Within Groups: F = 1.53 (2.19, 2.99) 
Size/Size X Style: F= 3.30 (5.14,10.92) 
Style/Size x Style: F= 7.30 (4.76, 9.78) 

We note first that the variances for both style and size are highly 
significant when tested against within groups. This tells us at 
once that the differences in general size means, or in general style 
means, cannot reasonably be attributed entirely to fluctuations in 
random sampling within the experiment. Had neither of the 
main effects proved significant when tested against within groups, 
our analysis would of course have been concluded." What kind of 
general conclusions we may draw about the influence of type size 
(or style), however, depends upon whether or not there is any 


the presence of an interaction if it does exist. The second is that 


in advance of the experi- 
dependent, and in which the major 


experimental material, 


The example we have tak 


en is perhaps a better illustration of 
the latter than of the for. 


mer type of situation. There does not 
appear to have been апу strong reason to Suspect, in advance of 
this experiment, that if one Style of type proves best in one size 
it will not also prove best inanother. We might therefore be dis- 
posed in this experiment to interpret the results on the assumption 


* It is possible, although unlikely, that the interact 
though the main effects are not 


between groups (main effects 
action as significant. If bo! 
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that there is no real interaction, unless, of course, the experimental 
data present convincing evidence to the contrary. That they do 
not is evident when we test size X style against within groups. 
The interaction variance is appreciably larger than the within 
groups variance, but not by an amount greater than could reason- 
ably be attributed to chance fluctuations in random selection of the 
experimental groups. We may feel justified, therefore, in retain- 
ing the hypothesis that there is ло real interaction, and proceed 
with our interpretation on that basis. 

If we assume no interaction we may, if we wish, proceed to test 
differences in individual size (or style) means by the /-test, basing 
our error estimate on the within groups variance. For instance, 
our estimate of the standard error of the difference in any two 
size means would, on our assumption, be 1.414 "V/668.5/ 40 = 5.77. 
Hence, for 108 d.f. a difference in size means would have to exceed 
1.96 X 5.77 = 11.31 to be significant at the 5 per cent level. We 
thus see that the difference in size means for A and B and for A and 
C are significant at the 5 per cent level, but that the difference for 
B and C could easily be due to chance. Similarly, a difference in 
style means would have to exceed 13.07 to be significant at the 
5 per cent level. Thus, on the assumption of no interaction, we 
could safely recommend B and C over A, and could feel sure that 
style I is inferior to the others for the population sampled. 

These recommendations, let us not forget, would be based upon 
the assumption of no interaction. While it may seem reasonable 
in this particular example, in general we would hesitate to rest so 
heavily upon this assumption. This would be particularly true 
if the interaction variance, although not significant, turned out to 
be so much larger than the within groups variance as it did in this 
example. In general, then, we would feel it necessary, in situations 
like this, to consider the results further in light of the possibility 
that there is, after all, a real interaction. Before doing so in this 
e to consider the advantages of the fac- 
in which the assumption of no interac- 
th a priori and experimental grounds. 


case, however, let us paus 
torial design in situations 
tion appears reasonable on bo 
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The advantage, in this example, is that we have used the same 
experimental material for two independent purposes — to de- 
termine whether size of type influences reading rate, and to deter- 
mine if style of type influences reading rate — and have accom- 
plished each purpose as effectively as if it had been the sole purpose 
of the experiment. In other words, our comparisons of size means 
have been just as precise as if style had been held constant for all 
pupils, and our comparisons of style means as precise as if only one 
size had been used. Our experiment has therefore yielded essen- 
tially the same results as would two independent experiments of 
the single variable type, each using 120 pupils. We have thus 
secured twice as much information per pupil as if we had conducted 
two independent experiments. Furthermore, we have demon- 
strated that the assumption of no interaction is tenable, which 
could not have been learned at all from independent experiments. 
Finally, we are in a position to compare any size-style combination 
with any other (by applying the t-test to differences in individual 
group means) which again would have been impossible in two sin- 
gle-variable experiments. All of this, let us remind ourselves, is 
on the assumption of no interaction. 

Let us now reconsider the results for the example in light of the 
possibility that there is a real interaction — which in general 
would 5e the safer point of view when the observed interaction 
variance is appreciably larger than the within groups variance. 
Let us first remind ourselves of what, in general, is meant by say- 
ing that an interaction exists between two classifications. It may 
mean that the rank order of the categories within one classification 
differs from category to category of the other, or it may mean that 
the rank order is the same, but that some differences within the 


first classification are larger or smaller in certain categories of the 


second classification than in others. With reference to our ex- 


ample, it may mean that the rank order of the styles differs from 
size to size, or it may mean that the rank order of styles is the same 
for all sizes, but that the superiority or inferiority of some styles is 
more pronounced in certain sizes than in others. The student 
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should not get the idea that a significant interaction necessarily 
means a variation in rank order. 

We may now recall that in a situation somewhat similar to this, 
the methods-schools type of experiment, we tested the methods 
differences against the interaction (M x S) variance. In that 
situation, we had no interest in school differences other than to 
eliminate them from the methods and error variances. In other 
words, we introduced “schools” into the analysis only in order to 
make possible a valid test of the methods differences, and to in- 
crease the precision of the test, and not because we were interested 
in the particular schools involved. In that situation, also, the 
particular schools used were considered as a random sample of all 
Schools in a specified population of schools. For the purpose of 
generalizing about all schools in the population, it was therefore 
appropriate to use the methods X schools variance as the error term, 
since it measured fluctuations due to thezazdom selection of schools. 

We have no very close parallel to this interpretation in our fac- 
torial design. We are now interested in evaluating differences in 
the means of both rows and columns, instead of only in one, and 
the particular styles (or sizes) involved may not strictly be con- 
sidered as a random sample from a “population” of styles (or 
sizes). The interaction variance in a factorial design is therefore 
usually not strictly a measure of normally distributed random 
fluctuations, which theoretically must be true of the error term 
in any F-test or t-test. However, so far as our style comparisons 
are concerned, we may take somewhat the same position as that * 
Suggested on page 150 with reference to the treatment-level type 
of experiment. We may be able to say that we have included all 
Sizes in which we are interested, or as wide a range in sizes as that 
in which we are interested. Similarly, while our sample of styles 
may not be considered as the equivalent of a random sample of all 
Styles in which we are interested, it may include all the styles in 
which we are interested. That is, we may have no desire to gen- 
eralize about other styles from the results of this experiment. If 
this position seems reasonable (and there will be some factorial 


170 ANALYSIS OF VARIANCE 


experiments in which it is not), it will be meaningful to test the 
variances for style and size against that for size X style, even though 
the latter is not strictly a measure of random variations, and though 
we may not interpret the probabilities from the F or і tables so 
literally as otherwise. In other words, if the variance for style 
is "significantly" larger than that for size X style, we may be 
reasonably confident that the rank order of the styles does not dif- 
fer greatly from size to size, and we may be running relatively little 
risk in recommending certain styles for all sizes (although recog- 
nizing that there may be a few exceptions to the general rule). 
Similarly, if the variance for size is "significantly " larger than that 
for size X style, we may quite safely recommend one size for all 
styles (or, as in the example, conclude that a certain size is inferior 
to the others for all styles). Furthermore, this procedure would be 
fairly safe even though the interaction variance proved significant, 
although in that case we would perhaps be more cautious, and be- 
fore generalizing to all styles or all sizes, would insist that the main 
effects be more highly "significant" when tested against interac- 
tion. Actually, in this example, only one of the main effects is 
“significant,” even at the 5 per cent level, when tested against 
interaction, but because of the lack of a priori reasons for suspect- 
ing a real interaction, we need hardly modify at all the conclusions 
earlier based on the tests against within groups alone. In other 
situations presenting similar experimental results, it might be 
safer to follow the procedure suggested in the following para- 

` graphs. Before concluding this paragraph, however, let us em- 
phasize again that the procedure just recommended is arbitrary 
in character, although wide experience in agricultural research in- 
dicates that it is usually satisfactory. 

We have now left to consider the case in which the variances for 
the main effects are not "significantly" larger than the interaction 
variance, and in which the interaction variance is significantly 
larger than the within groups variance. In this case the suggestion 
would be strong that the rank order of the categories of the first 
classification may differ markedly from category to category of 
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the second, and we could not hope to extend our recommendations 
concerning one classification to all categories of the other. We 
might nevertheless be able to reveal significant differences within 
one classification for specific categories of the other. How this 
may be done may be illustrated in the case of our example. 

Let us suppose, for the sake of illustration, that the size X style 
variance had proved significant, and that the size and style vari- 
ances had not been much larger than that for size X style. We 
would then have to restrict our recommendations (if any) concern- 
ing styles to certain sizes, and those for sizes to certain styles. 
That is, we would have to consider the observed style differences 
for each size separately and the observed size differences for each 
style separately. Suppose, then, we raise the question of whether 
there are real differences in style for size A alone. We would then 
compute the sum of squares for style means within size A alone, as 
follows: 


348 + 554° + 439° + 668° _ 2009 _ 184.5 
то 40 

The variance estimated from style means within size А would then 
be 5184.5/3 = 1728.2. The ratio of this variance to the within 
groups variance is F = 1728.2/668.5 = 2.57. For 3 and 108 d.f. 
this F is not significant, hence all differences in style means within 
size A could be attributed to chance, and we may not feel justified * 
in testing differences in individual style means within size A by the 
t-test. For size C, however, а similar test reveals that there are 
significant differences in style means, and we are clearly justified 
in applying the t-test to individual differences in style means within 
that size. The estimated standard error of a difference between 
two means of ro cases each is 1.414 М/668.5/то = 11.58, and 
hence for 108 d.f. a difference of 11.58 X 1.96 = 22.6 would be re- 
quired for significance at the 5 per cent level. We thus see that 
for size C the mean for style I is significantly lower than for 
t This, of course, depends on eaaa уаш chore to Sagen, a Ta 


particular F, for example, barely falls short o Ы 
workers would feel justified in testing individual differences. 
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all other styles, and that IV is significantly above I and II, 
though not III. In a similar fashion it may be shown that the 
variance for size means within style I is not significant, but that 
it is within style III, and that for this style size A is significantly 
below both B and C. We are thus able to make a number of rec- 
ommendations of styles for certain sizes and of sizes for certain 
styles, and have 108 d.f. available * for each test. It will be noted 
then, that we are essentially viewing the whole experiment as con- 
sisting of three independent or parallel single-variable experiments 
with styles, each employing 40 pupils, or of four independent single- 
variable experiments with sizes, each employing 3o pupils, except 
that our error term is in each case derived from all 120 pupils. 

The preceding illustration should offer a reasonably close parallel 
to most of the applications likely to be made of simple factorial 
designs in educational research? When applied in instructional 
experiments, one of the factors will usually be “methods,” the 
other will be some variable such as time spent in study, or distribu- 
tion of class time (double os. single period), or type of motivation, 
which may vary in degree within each method. While the particu- 
lar methods involved may not be considered as a random sample 
of any “population” of methods, it may frequently be true that 
they do include all of the methods in which we are interested at the 
moment. Hence methods would be analogous to styles in our illus- 
tration. While the categories or levels based on the second factor 
may also not be considered a random sample from a population of 
such categories, we may nevertheless often be able to include as 
wide a range in these categories or levels as is of any practical in- 
terest, with several well-distributed intermediate levels, and hence 
may follow the type of interpretation of the interaction term which 
has been here suggested in evaluating sizes. 

It is highly important to note finally, that in any such experi- 
ment, performed in a single school, any conclusions drawn would 


х On the assumption of homogeneous variance within groups. 


2 The student is urged, as a very valuable exercise, to invent or discover for him- 
self as many specific illustrations as possible of further applications of this design in 
educational research. 
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strictly apply only to the one school involved. Different results 
might be obtained in other schools, just as was pointed out on page 
103 with reference to the simple methods experiment. 


I2. FACTORIAL DESIGN DUPLICATED IN RANDOMLY SELECTED 
SCHOOLS 

The simple factorial design considered in the preceding section 
may find many useful applications, but its general usefulness in 
educational research is restricted by the facts that ordinarily one 
would not wish to confine his generalizations to a single school, 
and that a single school would seldom provide sufficient numbers 
for high precision even though the first restriction were acceptable. 
The preceding section, then, was presented in part in order to lead 
the student more gradually to an understanding of the procedure 
— now to be considered — which is appropriate when several 
Schools are involved. 

Let us suppose, for the sake of illustration, that we wish to 
evaluate three methods of instruction, and two motivating devices 
that may be used in conjunction with any of these methods. We 
thus have six possible combinations of method and device. Sup- 
pose that our experiment is to involve seven schools. Suppose 
that within each school we have been able to assign the pupils at 
random to six equal groups, one for each combination of method 
and device. These groups would of course differ in size from school 
to school. We will assume that a total of 560 pupils is involved 
in all schools. The experiment is conducted under the same con- 
trolled conditions in all schools. At the close of the experiment, all 
groups in all schools take the same criterion test. Within each 
school, then, we have an experiment of the type described in the 
preceding section. Our problem is to analyze and interpret the 
pooled results from these duplicated experiments. 

The basic data needed for the analysis would be the sum of the 
criterion scores for each of the 42 groups, and the sum of the 


squared scores for the entire sample. From these totals, all other 


necessary totals, and all required means and sums of squares, may 
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be computed. For convenience, we will use a notation in which 
numerical subscripts refer to schools, lower case subscripts to 
motivating devices, and capital subscripts to methods; T will 
stand for total, ss for “sum of squares,” and for the number of 
cases in a group or set of groups. Thus То», would be the total for 
all pupils in the group using device a and method B in school 3; 
Ti; would be the total for all groups using device b and method C. i 
Ti, for all pupils using device b in school 2, Т, for all pupils in school 
4, etc. 
The sum of squares for methods would then be 
Tit Tat Т2 1 
$$, = ق‎ £ _ 
тА N 
T without a subscript being the grand total, N the total number 
of cases, and т, = ль = nç the number of pupils under any one 


method. 
The sum of squares for schools would be 
IX T2 TS 
کک یوی‎ ee LM. va Eph ied 
M n, nmn N 


and the sum of squares for devices would be 
BBM Ty 
Ng La 
In this case there would be a number of interaction terms. 
The sum of squares for methods x schools would be computed by 
disregarding devices, and dealing o: 


din nly with 21 sets of pupils, each 
set consisting of the two groups that used the same method but 
different devices in a certain school. The sum of squares for these 


sets corresponds to the sum of Squares for “classes” in the problem 
of Sections 4 to 6 of this chapter, and will here be referred to as the 
sum of squares for methods within schools (notation: ssy з). 
This sum of squares would be 
2 2 
N eT MN +в + Т + Tey T 
Max nar N 
The sum of squares for methods x Schools would then be 
SSuxs = SSuins — 55 — SS. 


SSp 
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The sum of squares for methods X devices would similarly be 
found by disregarding schools, finding the sum of squares for 
methods within devices (based on 6 sets of pupils, each set consisting 
of the 7 groups which used the same combination of method and 
device), and subtracting the sums of squares for methods and de- 
vices. The sum of squares for methods within devices would be 

Tomba o9 9 quus 5 
SSyinp = TA UN 


and for methods X devices would be 
55и ур = SSuriad — 55м — 55р. 

The sum of squares for devices X schools would be found by sub- 
tracting the sums of squares for devices and schools from the sum 
of squares for devices within schools (based on 14 sets, each set 
consisting of the 3 groups using the same device but different 
methods in a certain school). The latter sum of squares would be 

Ta Th, n. LISTS T 


Spins ~~, Nex N 


and the sum of squares for devices X schools would be 
SSpxs = SSpins — 55р — SSs- 

In this case there would also be a triple interaction whose sum of 
squares would be the remainder left when all primary and double- 
interaction sums of squares are subtracted from the sum of squares 
for groups. The latter sum of squares (based on the 42 groups) 


would be 


рг Noa т 


Паду 
The sum of squares for methods X devices X schools would then be 
SSurxpxs = SSeoum — SSu — 59р — SSs— SS xs — SSuxv — SSp xs: 
The sum of squares for total (ss; is the sum of all squared scores 
minus 7*/N, and for within groups is 


SSyithin groups ^ 554 — SSgroups 
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(This sum of squares within groups would not be meaningful and 
would therefore not be computed unless the pupils had been as- 
signed at random to the various groups in each school.) 

The df.’s for methods, devices, and schools and for the double 
interaction terms are found as before. The d.f. for methods within 
schools (based on 21 pairs of groups) is 21 — т = 20. Hence, the 
df. for M X S is 20 — 2 6 = 12. The df. for methods within 
devices (based on 6 sets of groups) is 6 — т = s, hence the d.f. for 
MxXDisg—2-—21-2. The df. for devices within schools (based 
on 14 sets of groups) is 14 — І = 13. Hence the df. for D х S is 
13 ~6 ~1 =6. The df. for groups is 42 — 1 = 41, hence the d f. 
for MXDXSis41—2—1—6—32—2—6- 12. The d.f. 
for total is 560 — 1 = 559. Hence, the df. for within groups is 
559 — 41 = 518. These data * may be arranged as follows: 


Variance 


M 

D 

5 
MxD 


Mxs 
DKS 
MxDxsS 
Within groups 
Total 


| As in the simpler design of the preceding section, how we may 
Interpret the results depends upon whether or not we may assume 
that there is no real interaction between methods and devices. 
Our first step, then, will be to test M X D. In this case, however, 
it is not sufücient to test M x D against within groups. This is 
for the same reason that in the simpler methods-schools type of 
experiment (Sections 4 to 6) it was insufficient to test methods 
against within schools. In the latter type of experiment, we recog- 

1 Tt may be more convenient to think of the d.f. for M X Das the product of the 


d f.'s for M and D, and of the d.f. for 
, f- M X S as the product of the d.f.'s for M and 
5, etc. The d.f. for M x DX S is the product of the dfs for M, Ва 
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nized that while there might be real differences in the methods 
means for individual schools, these differences might differ in mag- 
nitude or even in direction from school to school, and that in the 
whole population the general methods means might be equal. 
We therefore found it appropriate to test methods against the inter- 
action of methods with schools, and if the methods variance did not 
prove signifücantly larger than the interaction (with schools) 
variance, we recognize that the observed difference in general 
methods means might be due to the chance selection of schools that 
favored certain methods. Similarly, in the situation we are now 
considering, it is possible that there is a real M X D interaction in 
some schools, but that this interaction varies in intensity or even 
in direction from school to school. Hence it is appropriate in this 
case, before generalizing about all schools, to test M X D against 
the interaction of M X D with schools, that is, to test M X D 
against M x D x S. If this test does not prove significant, we 
may retain the hypothesis that in the whole population of schools 
the interactions of M X D within individual schools may in effect 
counteract one another, with the result that the general population 
means for methods may have the same relationship for both de- 
vices, or the devices have the same relative effectiveness for all 
methods. 

In our example, the test of M X D/M x D X S is based on 2 
and 12 df. If M X D does not prove significantly larger than 
M X D x S, the hypothesis is tenable that there is no real over- 
all M x D interaction, and we may proceed on this hypothesis 
to test methods (and devices) on the basis of the total results. On 
that hypothesis we may (as in Section 5) evaluate the variance for 
methods against that for M X S, and that for devices against that 
for DX 5. It is conceivable, however, that there is no real inter- 
action of schools with either M, D, or M X D, and that the ob- 
served interactions involving schools are due to uncontrolled 
variables (such as the teacher variable) or to chance alone. (See 
pp.rro-irr.) Itisalso conceivable, although generally improbable, 
that the interactions of schools with M, D, and M X D are real but 
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of equal strength. In either case, the variances for M x S, D X S, 
and M X D X S would differ only by chance, or would all be esti- 
mates of the same true value. At any rate, if there are no appre- 
ciable differences between the three interaction variances involving 
schools (S), the hypothesis is tenable that they are fundamentally 
the same, and on this hypothesis a combination of the three of 
them will provide a better error estimate than any one alone. 
The usual procedure, therefore (before testing even M x D), is 
to first examine the M X S, D X S, and M X D X S variances. 
If they are all of approximately the same magnitude (no differences 
significant at the 5 per cent level), the error variance is obtained by 
adding the sums of squares for M x S, DX S, and M хрх 5, 
and dividing the total by the sum of the corresponding d.f.'s (in 
this case 12 + 6 + 12 = 30). We would thus have 


The interpretation of the results would then be exactly like that in 


Section 1r, with the difference that the conclusions would apply 
to schools in general (in the s 


pecified population) rather than only 
to a single school. If, however, large or Significant differences 
were observed between the М XS, DXS, and MxDxS 
variances, we would evaluate M x D against M x D x S, and, 
if M X D proved insignificant, would evaluate M against M X S 
and D against D х S as Suggested earlier. 

4 Ка significant M X D interaction were found (or if the M x D 
interaction were appreciably larger than error and the hypothesis 
of no interaction were otherwise doubtful) we would test M and D 
against M X D (just as we tested styles a; 
style). Tf these tests prove insignificant, w 
differences for methods Separately for ea 


nd sizes against size X 
€ would have to test the 
ch device, and those for 
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devices separately for each method, using as the error term either 
M x D X S or the pooled error term described in the preceding 
paragraph. Specifically, to evaluate the methods differences for 
the first device, we would consider all groups using that device as 
constituting a separate experiment (with 280 pupils) in which the 
variance for methods could be computed as in Section 6 preceding. 
For example, the sum of squares for methods with reference to de- 
vice a only would be 
Tat Tat ТЕ T, 
Maa Ma 

which when divided by the number of d.f. (2) would give the vari- 
ance for methods within device a. This variance could then be 
tested against the error term derived from the whole experiment. 
A similar procedure would be followed to evaluate the methods 
differences in relation to the second device, and to evaluate the de- 
vices in relation to each of the methods. 

The factorial design may be extended, in theory at least, to any 
number of factors, and to any number of levels within each factor. 
Thus, we could experiment simultaneously, for example, with 
three methods, two motivating devices, and two sets of reading 
materials, duplicating the experiment in a number of schools. If 
the interactions involving S were all of approximately the same 
magnitude, the error term would then be the sum of all of these in- 
teractions, and the primary and interaction variances involving 
the other factors would be computed in the manner already ex- 
plained. However, as the number of factors and levels increases, 
the combinations may become so numerous that few schools 
could provide even one pupil for each combination, and hence 
in practice these more complex designs will perhaps seldom be 
employed, and need not be considered here. 


CHAPTER VI 
ANALYSIS OF COVARIANCE 


I. INTRODUCTORY 
THE use of “matched” or equated groups to secure increased 
precision in methods experiments has long been widely practiced 
in educational research. (See Designs V and VI of Chapter IV.) 
This practice has usually resulted in a very worthwhile increase 
in precision, but often at the cost of considerable administrative 
inconvenience. If the equating of groups is done at the beginning 
of the experiment — as it should be — time must be taken to ad- 
minister an initial test, to score these tests, and to organize the 
equated groups before the experiment can get under way. The 
organization of the equated groups involves disrupting the classes 
as they are found already organized, and this again is often very 
inconvenient, if not quite impracticable. To avoid these difficul- 
ties, the device has sometimes been employed of doing the *match- 
ing” at the close of the experiment. The experiment is conducted 
with the classes as they are found already organized in the school, 
and then, at the close of the experiment, the results for such pupils 
are discarded from the final analysis as is necessary to make the 
means and standard deviations of initial scores alike for all classes. 
This of course means a loss of valuable information, and this loss 
may sometimes offset any advantage gained by the use of equated 
groups. It is therefore fortunate for the educational experimenter 
that the methods of analysis of covariance — an extension by R. А. 
Fisher of his methods of analysis of variance — now enable us to 
dispense with these inconvenient matching procedures and to se- 
cure the same increase in precision by the use of statistical controls. 


?. THE ESSENTIAL NATURE OF ANALYSIS OF COVARIANCE 
The essential nature of the methods of analysis of covariance 
may perhaps best be made clear in terms of a concrete illustration. 
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We shall first consider the relatively simple case in which the 
experiment is conducted in а single school, and in which the ex- 
perimental groups are selected at random from the available pu- 
pils. The more complex case in which the experiment is dupli- 
cated in a number of schools will be considered later. 

Suppose that in a certain school we have conducted an experi- 
mental comparison of three methods (A, B, and C) of teaching a 
given unit of content in seventh-grade arithmetic. Suppose 
that three equal experimental groups have been taught, one by 
each of the methods, and that the pupils were originally assigned to 
these groups strictly at random. Suppose that for each pupil we 
have a measure of initial ability in arithmetic secured at the be- 
ginning of the experiment, as well as a criterion measure of final 
achievement secured at the close of the experiment. 

Broadly stated, the hypothesis that we wish to test is that there 
are no real differences in methods, and that any differences in final 
mean scores of the methods groups, after allowances have been made 
for chance differences in initial mean scores, are due entirely to 
chance fluctuations in random sampling. This is not an exact 
statement of the null hypothesis, since we have not specified the 
manner in which the allowances for initial differences are to be 
made, but we will supply that deficiency later. The important 
point is that we hope through such allowances to attain the same 
precision as would have been attained had the groups been actually 
matched on the basis of the initial measures. 

The allowances for initial differences are to be made in terms of 
the regression of final on initial measures. If we were actually to 
ficient for the pupils in each group (or 
under each method) separately, the result would of course differ 
from group to group, but under our hypothesis these differences 
would be due only to chance. In other words, according to our 
hypothesis there is one true regression of final on initial measures 
which is the same for all groups. This regression may be referred 
to as the regression within groups. We shall see later how we may 
secure a valid estimate of this regression. For the moment, let us 


compute the regression coe 
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suppose that we have this estimate, and consider how it would be 
used to “correct” the final means for initial differences. 

The student will recall, from simple correlation theory, that if 
one knows the regression coefficient of Y on X for a sample, and 
knows also the amount x by which a given X deviates from the 
mean of the X’s, one can estimate the deviation of the correspond- 
ing Y from the mean of the Y's by multiplying the known x by 
the regression coefficient. Hence, given the regression coefficient 
and knowing the deviation of the initial score of any pupil from the 
initial general mean, we can compute the amount by which his 
final score would be expected to deviate from the final mean (“ех- 
pected” because of his initial ability only, regardless of the method 
by which he had been taught). If, then, we subtract' this ex- 
pected deviation (or correction) from his actual final score, we 
should have an "adjusted" criterion score whose relative status 
would be independent of the pupil’s initial ability. This adjusted 
score may be defined as Y-bx, in which Y is the pupil's actual final 
score, х the deviation of his initial Score from the general mean, and 
b the regression of final on initial scores. This adjusted score 
would then be such that any two pupils, regardless of their initial 
scores, would have the same adjusted score if their actual final 
scores exceeded (or fell short of) their expected final scores by the 
same amount. 

Suppose, then, that we had thus computed an adjusted criterion 
score for each pupil in the experiment. (As we shall see later, we 
need not actually adjust the score of each individual pupil in order 
to compute the mean of the adjusted scores.) If we then com- 
puted the mean of these adjusted scores for the pupils under each 
method separately, these means would be independent of chance 
differences in mean initial ability of the experimental groups. 
These means of adjusted Scores, or these adjusted means, should 
then have the same relative magnitude as if the experimental 

1 If the expected deviation wi 


tual final score; if the correctio; 
the final score. 


ere positive in sign, we would subtract it from the ac- 
n were negative, its absolute value would be added to 


DERIVATION OF Basic FORMULAS 183 


groups had been alike in initial ability, or had actually been 
matched with reference to the initial measures. Finally, by ap- 
plying the methods of analysis of variance to these adjusted scores, 
we could test the hypothesis that the differences in adjusted meth- 
ods means are due entirely to chance. 

While the foregoing is not an exact indication of the specific pro- 
cedures involved in the methods of analysis of covariance, it should 
make clear to the student the essential nature of these methods. 
Essentially, the methods of analysis of covariance will enable us: 
(1) to estimate the true regression of final on initial measures (on 
the assumption that there is no real difference in regression from 
group to group or method to method), (2) to use this regression 
coefficient to correct or “adjust” the final methods means so as to 
allow for differences in the initial measures, and (3), to test the 
significance of the differences remaining in the adjusted methods 
means. ‘The detailed steps in the procedure will be described 
later, but before going on to their description, it might be well 
first to consider the derivation of certain formulae which will be 


required. 


3. DERIVATION OF BASIC FORMULAS 
These formulae will be derived for a sample consisting of m 


groups of cases each. The notation employed will be as follows: 


X represents a raw score on an initial test 
Y represents a raw score on à final (criterion) test 
T, and Ty represent the sums of all initial and final scores 


respectively, and Му and Му represent the corresponding 


means. wt 
ea X Mand y= ИБ М, represent deviations from the 


ans for any individual. 


general me ‚ 
z= 25 andy = Z У represent the means of «’s and y’s for 
n n 
any group. ; 
The correlation rsy for the total sample would then be 


2 ху 
= Nazo, 
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or, for any single group it would be 
Z(x — х2)(у - y) 
n 
a(x — 3) o(y — y) 
Again, if the correlation within the group were computed from 
M x and M, as arbitrary origins, we could write 


ray (within any group) = 


7. (within any group) = 


' 96-29 96-5) 
Hence for апу one group 
Z(x — x)(y— y Ey 
(x — x)(y 2.2» 1.5 
"n n 
from which 


2 ху = (e — ¥)(y — 5) + n27 
Summing these expressions for all m groups we have 
ZXzy-ZZ(-z)y—-y)-4nZzy 
which may be more simply written 


Zxy-Z(-Zz)y—-5)tnZzy (19) 
if we understand that the summation is for the total sample. 

Thus we see that the total sum of the products (of deviations) 
may be analyzed into two components, just as the total sum of 
squares (of deviations) may be analyzed for either variable con- 
sidered alone. The components of the total sum of products (of 
deviations from the general mean) are the sum of the products of 
deviations from the group means and я times the sum of the prod- 
ucts of the group means (each mean expressed as a deviation from 
the general mean). 

The covariance of two variables for a sample is the mean of the 
products of their deviations from their means, just as the variance 
of a single variable is the mean of the Squares of the deviations. 
The best estimate of the covariance of a population that may be 


derived from a sample of cases is у and the best estimate of 


п—т' 
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the covariance of the means of such samples that may be derived 
: 2 ту А А 

from m means is ~~. The best estimate of the covariance of the 


m—1 
population that may be derived from the means of m samples of n 


PEZ) : 
E AA The argument supporting these statements 
mÁ- 
is of the same character as in the case of analysis of variance. 
We have seen [formulas (17) and (18) on page 92] how Z x° and 
= x may be computed from the raw scores. We shall need similar. 


expressions for the computation of У xy and Z ху. We may note 
first that 


cases each is 


ху = (X — M3)(Y - My) 
= XY — XM, - YM, + Mx My. 


It follows that for a sample of JV cases 
Улу= 2 ХҮ - МХХ - Мұх Y + NMy My. 

: DF 

Now, since each of the last three terms is equal to Wee 


two of them will cancel, leaving 


KIF 
= cR EAT SE — 
Day = 3 7 =F x 


Ty Ty 
—— 20 
r qo) 
Similarly, for m groups, 

anXxy- Ty; My, Tz, Mr, een + Ty, My, Tx My, 
from which, multiplying and dividing the right-hand terms by 
ж, 1a, etc., respectively, and noting that My, = Ty, /„„ we get 


Ж Жы Toda 


Pa Te Ж 
E ah Rn ime M ш = 2 
nDxy = x + ia, F + 8 N (21) 


in which the numerical subscripts refer to the groups. 


Let us now recall that ту for the total sample may be written 


У ху 
= 22 
ra (tota) = کے‎ (22) 


186 ANALYSIS OF COVARIANCE 


and that in the corresponding regression equation of Y on X the 
regression coefficient (b,..) may be written 


zy 
с È ху N 5 ху 
b,.2 (total) = ray 3 = SA : E= bc» (23) 
N 


The correlation and regression coefficients within any one group 
may similarly be written 
Tsy (within any group) = کے‎ _ (24) 
VZ(z — xy . Z(y yy 
and 
бу.» (within any group) = 26-35 -7) (25) 
Z(x—xy 
If we think of the summations as extending over all groups, ex- 


ed as representing an “average” 
correlation within groups for th 


senting the "average? 


the average b,., within 
groups) represent what we have called an "adjusted" final score, 


we may note that the mean of these adjusted scores for the total 
sample will be 


since Z x = o and X Y/N-M, F urthermore, since the devia- 


tion of a single adjusted Score from the general mean of adjusted 
scores is 


(F = be) = My = 94+ My — be — M, = y is (27) 
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it follows that the mean of these deviations for any single group is 


Z(y—bx) Ху Ух .- 2K 
n л n yf (28) 


which is also the deviation of the group mean of adjusted scores 

from the general mean of adjusted scores. Accordingly, the sum 

of the squares of these deviations for all m groups would be 
Z(y-ixy-ZEy-cbZxzycbZe (29) 


Now if we assume that all m groups are random samples from the 
same population, our estimate [see (14), page go] of the variance of 
adjusted scores for the population would be n Z(y — bx)?/m — т, 
and the “sum of squares” used to compute this variance would be 
^Z(y—bxy. From (29) 

nEZ(y—bxy-nuZy-zbnZzy-buEm (30) 

We now note that the first right-hand term, » Z 9’, is the sum of 
squares for variance belween groups in an analysis of the variance 
of the Y scores. Similarly 7 È x? is the sum of squares for variance 
between groups in an analysis of the X scores, and n Z ху is the ex- 
pression whose computation is given by (21). From (30) we can 
then compute the sum of squares for variance between groups for 


the adjusted scores. 
For any one group, the sum of the squared deviations of the ad- 


justed scores from their mean for the group would be 
Dy — y) - ble — SF = (у 37 – 2 0 (e — #) — 5) + 
b ZX(x—xy 
=2(y— 7-2 келәт 3) у@—-®)(у-Ў) 
ze-20-[lz..z 
+l; DE- a) ]ze 2) 
> La = yy 
-22Z(y- s | os = (81) 


This expression (31) may also be used to represent the sum of 


188 ANALYSIS OF COVARIANCE 


Squares within groups for the total sample, again considering the 
summations as extending over all groups. 

We may now return to our illustration of the methods experi- 
ment. We had noted that if we could estimate the true regression 
of Y on X for all methods groups, we could compute an adjusted 
score for each individual such that the effect of differences in ini- 
tial ability would be eliminated from these adjusted scores. We 
have now seen (25) that we can compute this regression coefficient 
if we have the sum of products and the sum of squares of initial 
Scores within groups. The needed sum of squares within groups, 
Z(x — x), for initial Scores may be found by analyzing the vari- 
ance of the initial scores by the method of Section 2 of Chapter V. 


The sum of products 
cured, according to (20) 


grand totals divided by the tot: 
The sum of products 
according to (21), by 


We would then have the s 
methods to correspond to th 
methods for either initial or 


ums of products for total and for 
€ sums of squares for /ofa] and for 
final scores. The sum of products 
within groups, E(x — х)(у — y), would then be found, according to 
(19), by subtracting the sum of products for methods from the sum 


DERIVATION OF Basic FORMULAS 189 


of products for total, just as we secure the sum of squares within 
groups by subtracting the sum of squares for methods from the sum 
of squares for total. The ratio (25) between the sum of products 
within groups (error) and the initial sum of squares within groups 
would then be the regression coefficient needed to adjust the final 
scores. 

An analysis of the variance of the final scores would similarly 
yield a sum of squares within groups, Z(y — yF, for final scores. 
The sum of squares within groups for the adjusted scores could then 
be computed from (31). This sum of squares may then be used 
to compute the variance of adjusted scores within groups for the 
total sample, which would represent the * adjusted" error-variance 
used in evaluating differences in final adjusted means for the 
methods. It is important to note, however, that this error vari- 
ance must be computed with one less d.f. than before, since one 
d.f. was utilized to compute the regression coefficient involved in 
(28). 

To evaluate the methods differences for adjusted scores, we must 
first find the adjusted variance for methods. We have already 
found the adjusted sum of squares within groups. If we can now 
find the adjusted sum of squares for the total sample, we can then 
secure the adjusted sum of squares for methods by subtraction. 

It should be noted that an adjusted sum of squares for methods 
(between groups) could be found, by means of (30), from the sums 
of squares for methods secured in the analyses of initial and final 
scores, and by computing b from (25) and r У x y from (21). This 
sum of squares, however, would not be appropriate for testing the 
significance of the methods differences. The adjusted sum of 
squares for methods which is estimated from (30) is inflated by 
sampling errors in the estimate of b which is utilized, and may 
ce appear more significant than it really is. 

To test the significance of the methods differences in adjusted 
scores we therefore use a “reduced” estimate of the methods 
variance, which is derived as follows: we first compute the total 
sum of squares that would have been found for adjusted scores had 


make the methods varian 
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the regression coefficient used in the adjustment been that derived 
from the total sample. In this case the b used would be that of 
(23), and the sum of squares of deviations from the general mean 
for the adjusted scores (Y — bx) would be 


BIY — bx) — М, -Z( + M, — bx — M, 


= Z(y — bx)? 

= 3۴ - 205 ху + 0° < дз 

(Z xy)‏ ر و 

Zy 2 шут; Zw 
x _ © ху)? 

EE (32) 


Having calculated the total sum of squares for adjusted scores 
from (32), we subtract the sum of squares within groups computed 
from (81), to secure a "reduced" sum of squares for methods. 
We then compute a reduced variance for methods from this reduced 
sum of squares, and then apply the F-test to the ratio between this 
reduced variance for methods and the error-variance for the ad- 
justed scores. 


If this test indicated that there are real differences between 


methods, we could Proceed to compute each adjusted methods 
mean by finding the deviation of the initial mean for that method 
from the general initial mean, multiplying this deviation by the b 
computed from (25), and subtracting the result from the final 
In other Words, each methods mean would 
€ Way we would adjust a single score. We 


dividual. The only 
for analyses of varian 
products (Z XY) of i 


ANALYSIS OF COVARIANCE IN A SIMPLE METHODS EXPERIMENT IQI 


Even so, the student is hardly to be blamed if at this point he 
considers analysis of covariance as extremely complicated. Much 
of this apparent complexity has perhaps resulted from the writer’s 
attempt to simplify the discussion by showing every step of the 
derivations, in order that they may be followed by one not skill- 
ful in mathematics. The applications of the end results of these 
derivations in an actual computational problem is not at all diffi- 
cult, as the following illustration will show. 

4. ANALYSIS OF COVARIANCE IN A SIMPLE METHODS EXPERIMENT 

Suppose that a methods experiment involving three methods 
has been performed in a single school, and, for the sake of sim- 
that only 12 pupils were involved in the ex- 
periment. Suppose that the pupils were assigned at random to 3 
groups of 4 pupils each, one group for each method. We will use 
the letters А, B, and C to refer both to the methods and the cor- 
Suppose, finally, that measures of achieve- 
g different tests) were secured both 
and that the scores 


plicity of illustration, 


responding groups. 
ment in the subject taught (usin 
at the beginning and the end of the experiment, 


on the initial (X) and final (Y) tests were as follows: 


The steps in the analysis are as follows: 
т. Find їйє sums of initial scores, squared initial scores, final 


scores, and squared final scores, and the sum of products of initial 
and final scores for the entire sample. 
The results for the illustrative problem are as follows: 
У X = збо ЖҮ = 264 
Z Х° = 13748 Z Y* = 7454 
Z XY - 9832 
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(These five terms may all be secured in a single operation on 
an automatic Monroe computing machine.) 
2. Compute the total and mean for each group for initial and final 
Scores separalely. 
Results for illustrative problem: 


Initial Scores Final Scores 
Total Mean Mean 


3. Analyse the total “зит of squares" for inilial scores inlo the 
METHODS and WITHIN GROUPS components, following the pro- 


cedure of Section 2 of Chapter V. Do the same for the final 
Scores. 


Results for illustrative problem: 


Methods 


Within Groups 
Total 


4. Compute the total sum of products [according to (20)]. 
2 XY -GT,- GM, = 9832 — (360)(22.0) = 1912 
5. Compute the sum of products for METHODS [according to (21)]. 
(146) (24.5) + (148) (27.5) + (66) (14.0) — 7920 = 651 
6. Subtract the sum of products for METHODS fron. that for TOTAL 
to secure the sum of products WITHIN GROUPS (error). 
1912 — бут = г2б1 
7. Compute the adjusted sum of squares WITHIN GROUPS [according 
to (31)]. 


= 386. 
1854 “386-33 
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8. Compute the adjusted total sum of squares [according to (32)]. 


ore)” 7 
1646 Lois = 405.92 
о. Subtract the adjusted sum of squares WITHIN GROUPS from that 


for TOTAL to secure the reduced sum of squares for METHODS. 
405.92 — 386.33 = 19.59 
10. Compute the reduced variance for METHODS. 
19-59 
= = 9.80 
2 9. 


тт. Compute the adjusted error (WITHIN GROUPS) variance, noting 
that the d.f. is one less than for the WITHIN GROUPS variance of 
either initial or final scores. (One df. is used in computing 
the regression coefficient from the within groups sums of 

squares and products.) 
386.33 

8 

12. Divide the reduced methods variance by the adjusted error vari- 
ance to secure the F used to test the adjusted methods differences. 


= 48.29 


9.80 


48.29 
n т, it is obvious that the methods differ- 


= .20 


Since the F is less tha 
ences are not significant. 


It is instructive to note that t 
justed final score is 1244/9 = 138.22. Hence, through analysis of 


covariance we have reduced the error variance from 138.22 to 48.29, 


or almost tripled the precision of the experiment. 

The extent to which the use of the methods of analysis of covari- 
ance increases the precision of an experiment of this type depends 
upon the within groups correlation between initial and final scores. 
The ratio between the adjusted error variance and the unadjusted 
error variance is very nearly equal * to (1 — т), r being the within 

1 It would be exactly equal if it were not for the loss of 1 d.f. for the adjusted error 


he error variance for the unad- 


variance. 
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groups correlation. In the illustration here used, according to 
Q4), 


1261 
————— - 83. 
V (1854)(1244) 


The ratio between the adjusted and unadjusted error variances is 
48.29 
138.22 
tion of .83 which is found in this example is of course higher than 
would be found in most actual methods experiments. Ordinarily, 
the within groups correlation between initial and final scores will 
not exceed .70, hence the use of analysis of covariance will seldom 
more than double the precision of the experiment. 

It will also be instructive to compute the adjusted methods 


means in the illustrative problem. To do this, we must first find 
the value of b. From (25), 


Yzy (within groups) = 


= .35, which is very nearly equal to (т — .83°). The correla- 


The initial mean for Method A deviated from the general mean by 


36. $ ¬ 30.0 = 6.5. Hence the adjusted final mean for Method A 
is 

24.5 — (.6802) (6.5) = 20.08, 
Similarly the adjusted mean for Method B is 

27-5 — (.6802)(7.0) = 22.74 
and for Method C is 

14.0 — (.6802)(— 13.5) = 23.17. 
Thus we see that the differences betwe 


en the methods means for 
adjusted scores is very much less than 
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the C group. After adjustment, the C mean is higher than the 
others, where before it had been much lower. 

If the differences between methods had proved significant, we 
would have wished to compute the standard error of a methods 
mean and the standard error of the difference between two means 
to evaluate the individual differences between adjusted methods 
means. ‘The standard error * of an adjusted mean would be the 
Square root of the adjusted error-variance divided by 4, or 


8.2 
uU [ee 540. 
4 


The d.f. for this standard error is the same as for the adjusted error 
variance, or 8. The standard error of a difference between ad- 
justed methods means is (1.414)(3.49) = 4-92- Since for 8 de- 
grees of freedom a ż of 3.36 would be required at the т per cent 
level of significance, a difference between adjusted methods means 
would have to be larger than 16.5 to be significant at that level. 
It may be well, finally, to remind ourselves of the various assump- 
tions — some of which have not heretofore been stated explicitly 
— that are involved in an analysis of the type just illustrated. 


Тһеу are: 
1) That the methods groups were selecte 
same population. 


d at random from the 


standard error of the difference between two ad- 
enough for most practical purposes. 
he adjusted final means is given by 


. * This method of computing the 
justed means is not quite correct, although near 
Actually, the variance of the difference between t! 


2, X47 Хэ)? | a 
eu [2+ se |" 


of squares, 7 is the number of pupils per class, 


in which s? is the adjusted error sum OEE 
Х, the initial sum of squares within groups. For 


X, is the initial A-mean, and Za? is 
methods A and B this result is 


2 ү (605) | 48.29 = 24.15. 
4 18540 

i i i ich is almost 
The stand { the difference is the square root of this, or 4.91, whic 
the same is spen (4.92) computed above. For methods А and-C, however, P 
standard error of the difference thus computed is 5.87, and for B and Cis 5.02. 1 т 
two methods of computing the standard error of the difference will obviously yie 
nearly the same result if the corresponding initial means are close eun 55, 
ordinarily would be if the groups were randomly selected, but may differ markedly 
if the groups are not so selected. 
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2) That the distribution of adjusted scores within groups is 
fundamentally normal. 

3) That the groups are homogeneous in variability. 

4) That the regression of £nal on initial Scores is fundamentally 
the same from group to group. 

5) That the regression is linear. 


5. ANALYSIS OF COVARIANCE IN DUPLICATE EXPERIMENTS IN 
RANDOMLY SELECTED SCHOOLS 

We may now see how the methods of analysis of covariance 
may be applied in a more complex experimental design involving 
duplications, in each of a number of different schools, of an experi- 
ment of the type just considered. The procedure in this case will 
be much the same as before, the principal difference being that . 
our error estimate will be based on the interaction. (M X S) 
variance, rather than on the within groups variance of the ad- 
justed scores. The "adjustment? of the criterion scores will 
accordingly be based on a regression coefficient derived from the 


sums of squares and products for M x S, rather than for within 
groups. 


There will also be certain changes in the assumptions involved. 


In the problem of the preceding section we assumed that the true 
regression of final on initial Scores was the same from one methods 
group to another, and that this regression was linear. We shall 
now have to assume that the true regression of final on initial 
class means, with methods and school differences eliminated, is the 
same from method to method, and is linear. That is, we assume 
that after the class means for both initial and final scores have 
been “corrected” (see Page 109) so as to eliminate methods and 
school differences, the true regression of final on initial values 
(weighted) of these corrected means is the same from method to 
method. In the problem of the preceding section, we assumed 
also that the pupils were assigned at random to the various methods 
groups. "This assumption is not now necessary, but we must 
assume that the classes in each school were assigned at random to 
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the methods. (The exception would be if we wished to test the 
adjusted interaction variance itself, in which case we would have 
to randomize the pupils.) 

We will first illustrate the computational procedures in a con- 
crete example in which our only objective is to test the significance 
of the differences between methods, using the interaction (M X S) 
term as the error term. Later we will consider the modifications 
in procedure that would be necessary if we wished to test the 
hypothesis that there is no real interaction of methods and schools. 

The data used in this example were obtained from an experi- 
n of two methods of improving punctuation 
ability at the fifth-grade level. The initial measure (X) was the 
score on a general English usage test administered at the begin- 
ning of the experiment. The criterion measure (У) was the score 
on а punctuation test administered at the close of the experiment. 
The experiment was planned as in the example of the preceding 
Section. In each of five schools one of two classes of equal size 
was taught by Method A, the other by Method B, for a period of 
18 weeks, (The scores of all pupils on the initial and final tests 
are given in Table 12, on page 198. The student is urged to check 
all computations as an exercise.) 

The steps in the analysis are as follows: 

т. For both initial and final scores, compute the total and mean for 

each class, for each school, and for each method. 


For the example: 


mental compariso 


TOTALS AND MEANS OF INITIAL SCORES 


Method B Schools Number 


Totals Means Totals Means Pupils 


Method A 
Totals Means 


62.3438 4264 66.6250 64 


School x 226 0.906 1095 

School 2 1852 1642 Ч 1608 89.3684 3530 02.8047 38 

School 3 824 54.0333 850 57.2667 1683 56.1000 30 

School 4 839 76.2727 узі 66.4545 1570 71.3636 22 

School 5 414 50-1420 424 60.5714 838 59.8571 14 
= 67.9405 11885 = СТх N= 168 


Methods 6178 73.5476 5707 
GMx = 70-7441 


GTx + СМх = 840,794 


У Хз = 010289 


198 


GUAGE INSTRUCTION 


TABLE 12 


Scores on INITIAL (X) AND FINAL (Y) Tests ков Рорпѕ WHO STUDIED 
UNDER METHODS A AND B IN AN EXPERIMENT IN FIFTH-GRADE LAN- 
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——— 

School r School 2 9 12 45 38 

A B A B 66 29 54 39 

M т x Ë x т xX F 68 то 56 39 

64 41 77 50 66 43 90 68 : 48 " Ет 

55 41 69 59 94 74 99 84 95 65 39 33 

77] 64 82 71 85 33 93 74 6: 42 47 35 
71 42 67 53 107 59 95 50 


A 


School 4 
B 
Y 


64 
72 
56 
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TOTALS AND MEANS OF FINAL SCORES 


Method A Method B Schools 
Totals Means "Totals Totals Means 


School т 1526 47.6875 1543 8 3069 47-0531 
School 2 1077 56.6842 1350 М 2427 63.8684 


School 3 485 32.3333 622 1107 36.9000 
School 4 бут 50.1818 605 55. 1256 57.0000 
School 5 257 36.7143 358 615 43.9286 
Methods 3996 475714 447$ 874= GTy N= 


GMy = 50.4405 GTy · GMy = 427,432.8 


= Үз= 477432 


2. For the initial and final scores separately, find the sums of 


squares for M and for M X S. 
(Follow the procedure of steps 1 to 7 on pages 120-122.) 


2) 2° ху 
M 1319.5 1382.5 
M xs 904-7 2033.9 


3. Compute the sums of products for М, 5, and CLASSES. 
[According to (21), page 185.] 
Sum of products for methods equals: 
[(6r78) (47.5724) + (5707)(53-3095)] — 599486 = —1352.6 
Sum of products for schools equals: 
[(4264) (47.0531) + (3530) (63-8684) + (1683)(36.9000) 
+ (1§70)(57.0909) + (838) (43-9286)] — 599486 = 18989.7 
Sum of products for classes equals: 
[(2260) (47.6875) + (1832)(56.6842) + (824) (32-3333) 
+ (839) (59.1818) + (a14) (36-7143) + (1995) (48.2188) 
+ (1698) (71.0526) + (859) (41.4667) + (731) (55.0000) 
+ (424)(51.1429)] — 599486 = 18411.6 
The sum of products for methods is negative, as we should expect 
from the fact that the initial mean is higher for A than for B, while 
the final mean is higher for B. In other words, the initial and 
final methods means are negatively correlated. 
4. Subtract the sum of products for м and for S from that for classes 


to secure the sum of products for M X S- 
18411.6 — 18989.7 — (— 1352-6) = 774-5 
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5. Arrange the sum of squares and products for m and m X s in 
tabular form, and add the m and м X s terms in each column. 


ргы Z ху Zy 
M 1319.5 — 1352.9 1382.5 
MXS 994-7 114-5 2033:9 
Mc-MXS 2224.2 — 578.4 3416.4 


6. Compute the adjusted sum of squares for м X s. 
[According to (32), page 19o.] 


( ху) 
Zy- Xa (for w x s) 
— (774-5) _ im 


904.7 
7. Compute the adjusted sum of squares for M+M X s. 
[According to (32), page 190.] 
(= 578.4)? 
2224.2 
8. Subtract the adjusted sum of squares for м X s from that for 
M+M X S to get the reduced sum of squares for m. 
3266.0 — 1370.9 — 1895.1 
9. Divide the reduced sum of squares for м by the d.f. for м to 
secure the reduced METHODS variance. 
1895.1 +1 = 1805.1 
10. Divide the adjusted sum of squares for m X s by its d.f. to secure 
the adjusted m X s (error) variance, noting that the df. for the 
adjusted m X s variance is one less than for the unadjusted м X s 
variance. (One df. having been used in the computation of 
the regression coefficient from the M х S terms.) 
1370.9 + 3 = 457.0 
тт. Divide the reduced variance for METHODS by the adjusted m X 5 


variance. The result is the Y used to test the significance of the 
adjusted methods differences. 


3416.4 — 


= 3266.0 


1895.1 
Fe Ed 
ams ES 


ANALYSIS OF COVARIANCE IN DUPLICATE EXPERIMENTS 201 


For 1 and 3 d.f. an F of 10.13 is required for significance at the 
5 per cent level. Hence we cannot, in this case, reject the null 
hypothesis for the methods differences. 

The foregoing test of significance would ordinarily conclude the 
analysis in an experiment of the type illustrated. However, it 
may sometimes be desirable to test the hypothesis that there is no 
real interaction of methods and schools. This hypothesis could 
be tested only if the pupils had been randomly assigned to the 
methods groups in each school, and then only on the assumption 
that the true regression of final on initial scores was the same, not 
only from group to group within the same school, but also from 
School to school. It may be well to show then, in the case of our 
example, how to test the hypothesis that there is no real M X S 
interaction when the scores are adjusted for differences in initial 
ability. 

Under this hypothesis, the scores would be adjusted on the basis 
of the regression within classes, just as was the case in the example 
On pages тот ff. Hence we would have to complete the analysis 
9f initial and final scores begun in Step 4 preceding, so as to secure 
the sums of squares within classes. We should also have to com- 
Pute the sums of products for within classes, by computing first 
the sum of products for classes [see (21)] and for the total sample 
[see (20)] and subtracting the former from the latter. For the 
example, the sums of squares and products for M X S and within 
classes are 


Pr Zay zy 


MxS 904-7 774-5 2033.9 


Within Classes 39440.1 23015.4 32268.3 
M x S + Within Classes 40344-8 24689.9 34302.2 


We would then find the adjusted sums of squares for within 
classes and for M x S+ within classes according to (31) and 
32), and subtract the result for within classes to secure the “re- 
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duced"' sum of squares for M X S. In the example, the results 
are 

Adjusted sum of squares for M X S + within classes = 19192.7 

Adjusted sum of squares for within classes = 17767-1 

Reduced sum of squares for M X S = 1425.6 
The reduced variance for M X S is accordingly 1425.6 + 4 = 356.4, 
and the adjusted variance for within classes is 17767.1 + (т 58 — 1) 
= 113.2. The ratio between these variances is 356.4/113.2 = 3.15. 
Since, for 4 and 150 d.f., the F required for significance at the 5 per 
cent level is 2.43, and at the т per cent level is 3.44, this F of 3-15 
constitutes very convincing evidence that there is а real interaction 
effect. It was particularly appropriate, therefore, that we earlier 
used the interaction variance as the error term. 

It is interesting to note, from the data in Step 4 preceding, that 
the M X S or error variance for the unadjusted final scores was 
2033.9/4 — 508.5, as compared to 1370.9/3 = 457.0 for the ad- 
justed scores. The increase in precision due to the use of analysis 
of covariance was in this case slight, amounting to only about a 
то per cent increase. Itis significant, however, that the reduction 
in the sum of squares was more pronounced (from 2033.9 to 1370.9), 
but that this advantage was dissipated by the loss in df. (from 
4 to 3). The loss of x d.f. was in this case very serious, since we 
had only 4 df. for Mx S originally. In general, therefore, it 
would be best to use a sufficient number of schools so that the 
effect of the loss of 1 d.f. in computing the 
would be more nearly negligible. 

It is difficult, for an experiment of this type, to predict the 
efficacy of the methods of analysis of covariance, even though one 
can anticipate accurately the correlation of initial and final scores. 
In a simple experiment of the type illustrated on pages 107 ff., the 
increase in precision due to the use of statistical controls depends 
only upon this correlation. (In that case the ratio of the ad- 
justed error variance to the unadjusted error variance is very 
nearly equal to (1 — т), r being the correlation within classes for 
the initial and final scores.) In the case of the present design, 


regression coefficient 
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however, in which the interaction variance is used as the error 
term, the increase in precision depends upon the correlation of the 
interaction effects (upon class means) for initial and final scores. 
This correlation may be computed from the sums of squares and 
products for M X S by means of (24). In the case of the illus- 
trative problem it is 774.5/У 904.7 X 2033-9 = -57. If there is 
no real interaction, this correlation will tend to be the same as the 


correlation within classes. If there is a real interaction, the effect 


may be either to decrease or increase the correlation of interaction 
effects. In the illustration, the correlation within classes is 23914.8/ 
V/39440.1 X 32268.3 = .66. The difference between the correla- 
tions based on the M X S and within classes terms could be due to 
chance, but it is more probably due to the significant interaction 
effect. Tn general, the effect of the interaction would be to make 
the M x S correlation smaller than the correlation within classes, 
and hence would tend to make the use of analysis of covariance 
less profitable. А 

It is interesting that in the illustrative problem the variance 
ratio (F) for methods and M X S was 4.15 for the adjusted scores, 
but only 2.72 for the unadjusted scores. The change in the P's, 
then, was much larger than the change in the M X S (error) 
variances. Tt just happened in this case that there was an un- 
usually large chance difference in initial means in favor of the A 
group, while the final difference favored the B group. ‘The 
adjusted difference in final methods means was consequently larger 
than the unadjusted difference. In general, in experiments of this 
type, the difference in final means would tend to be in the same 
direction as the initial difference, hence more frequently the 
adjusted difference in final means would be smaller than the 
unadjusted difference, and the adjusted methods variance would 
be less than the unadjusted methods variance of final scores. 
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6. ANALYSIS OF COVARIANCE TO “HOLD CONSTANT" OTHER 
VARIABLES THAN INITIAL STATUS 

In the preceding examples, the methods of analysis of covari- 
ance were employed primarily in order to increase the precision 
of the experiment, and not because the relationship of the initial 
to the criterion test scores was of any interest in itself. In such 
situations, the test to use as the initial test is that which is likely 
to show the highest correlation with the criterion, regardless of its 
other qualities. Usually, the test most likely to satisfy this re- 
quirement is one that measures initial status in the same trait 
that is measured by the criterion test — often exactly the same 
test may be used to secure both the initial and final measures. 

The same methods of analysis may be employed with reference 
to any concomitant variable, whether or not it may be measured at 
the beginning of the experiment, in order to determine how the 
criterion means would have differed if this concomitant variable 
had been “held constant." For example, in an experiment con- 
cerned with certain methods, it may be suspected that one of the 
methods may motivate the pupils to spend more time in study 
out-of-class than the others, and it may be desired to know which 
method would have resulted in highest achievement had the pupils 
spent the same total time in study under each method. Suppose 
that a record was therefore kept during the experiment of the 
amount of study time for each pupil, and that from this record the 
total time for each pupil was determined. By the methods of 
analysis of covariance, the differences in mean scores on the cri- 
terion achievement test could then be “adjusted ” so as to eliminate 
the effect of time differences. This would be done in exactly the 
same way as if the time measures represented scores on an initial 
test. One could thus determine whether or not the mean differ- 
ences in achievement are significant both when time is “held 
constant" and when it is not, as well as whether or not the time 
differences themselves are significant. In other words, the ad- 
justment would indicate what differences in achievement would 
have been found had all pupils spent the same amount of time 
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in study (assuming linearity of regression, as well as constancy of 
regression from group to group). (If one also desired to allow 
for chance differences in initial status, one could do so by the 
methods suggested in the following section.) 


7. STATISTICAL CONTROL OF MORE THAN ONE CONCOMITANT 
VARIABLE 

If it is desired to eliminate the effect of more than one uncon- 
trolled variable in an experiment of the type described in the pre- 
ceding section, this may be done by an extension of the methods 
just considered. In this case the adjustment will be made in 
terms of the multiple regression equation (see Chapter VII) be- 
tween the criterion and the uncontrolled variables. The regression 
coefficients can be computed as before from the error terms secured 
through analyses of the variances and covariances of the variables 
involved. 

In the case where allo 
(X and Z), the multiple regression equatio 

y = bye + bz 


To compute these regression coefficients, an analysis of variance 
must be carried through for each of the three variables and for the 
three possible covariances. Having found the error term (sum of 
squares or products) in each of these analyses, the results may be 
substituted in the following simultaneous equations, which may 
then be solved for b, and bz- 

Day = baw + bzxz 
Day = b2 x3 + 6,2 2 
d score (Za) will then be 


wance is to be made for two initial measures 
n will be 


The formula for computing any adjuste 
Y, = Y = bs — 62 


The total sum of squares for the adjusted scores will be (see 


page 190) 
Z(y- b,x — bz) = 


35222 ay e Bay + e E 
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Each of the components of the total sum of squares of adjusted 
Scores may be computed by the same formula from the correspond- 
ing components of the sums of squares and products for the three 
variables involved. The error variance for adjusted scores will 
then be computed as before, after having allowed for the two 
degrees of freedom utilized in computing the regression coefficients. 
The reduced variance for methods would then be computed in a 
manner similar to that already described, and the test of signifi- 
cance applied as before. 

Similar methods could be employed to allow for still other 
initial measures, but obviously with a tremendous increase in the 
amount of labor involved. The computational task for two 
initial measures is not at all unmanageable, and may sometimes 
be worth while in educational experiments, considering the ease 
with which additional measures may be secured. The advantage 
gained depends upon the magnitude of the multiple correlation 
coefficient for the contemplated number of variables as compared 
to that for the best combination of a smaller number. Experience 
with educational tests has shown that in situations of this kind 
the multiple correlation of two initial measures with the criterion 
will seldom be very much higher than the higher of the two zero- 
order correlations, and that usually only a negligible increase in 
the multiple correlation is secured by adding a third dependent 
variable (assuming, of course, that the two already selected are the 
best two for the purpose). There would be very little point, there- 
fore, to a discussion of the more complex procedures required for 
three or more initial measures. 

For the case of two initial measures already considered, it may 
be worth pointing out that the multiple correlation Ry. 2, between 


the initial measures and the criterion may be computed from the 
formula 


Ru. DZ xy + b2 zy 
ЖО С ЛШ 


using either the total sums of products and squares or those for 
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theerror term secured from the analyses of variance and covariance, 
dependent upon whether the correlation is desired without or with 
the effects of schools and methods eliminated. How much the 
labor of allowing for both variables is worth while is then dependent 


upon how much 4/i — R%y.,, is less than either 4/1 — rz, or 
МІ – 72. 


CHAPTER VII 


MISCELLANEOUS PROBLEMS IN CORRELATION ANALYSIS 


I. INTRODUCTORY 
Tuts final chapter differs so much in character from those pre- 
ceding that some introductory explanation seems demanded. As 
was explained in the Preface, this book has been written primarily 
in order to make more readily available and comprehensible to the 
research worker in education those relatively recent developments 
in statistical theory and practice which, although of apparently 
high promise in educational research, have thus far received no 
mention (or have been only very inadequately discussed) in the 
standard and widely used texts in educational statistics. Thus far 
in this book it has been possible to present a fairly comprehensive 
discussion of each general problem presented, without any conse- 
quential duplication of the content of other texts in this field; that 
is, each chapter has been fairly complete in itself. To attempt any 
comprehensive discussion here of the general problem of correlation 
analysis in educational research, however, 


would involve a great 
deal of duplication of w 


hat has already been well done elsewhere. 
Most of the major aspects and applications of correlation analysis 
have been adequately treated in many available texts. Indeed, 
methods of correlation analysis appear to have been more widely 
used and more critically studied in education and psychology than 
in any other field of research. This is true in part because in these 
fields we are naturally interested in the organization, i.e., in the 
inter-relationships of mental functions, and because correlation 
techniques are so well adapted to the objective study of those rela- 
tionships. It is true also because the instruments available for 
measuring these traits are so much more fallible than the measuring 
instruments employed in most of the other fields of research. Itis 
consequently of greater importance in this field that we know how 
fallible our measures are, or that we have objective estimates of the 
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errors of measurement involved. Again, the methods of correlation 
analysis are peculiarly adapted to these purposes. 

In accordance with the afore-mentioned purpose, then, this chap- 
ter will (with a few exceptions) deal only with those aspects of cor- 
relation theory that do not seem to have received adequate atten- 
tion in current texts in this field. It may be noted in particular 
that, in spite of the frequency with which we have had occasion to 
Use correlation methods in education, we appear to have given very 
little consideration to some of the assumptions underlying their use, 
and have consequently been guilty of many misapplications of 
them. Oneof the purposes of this chapter is to draw the student's 
attention to some of the more serious of these errors, and to point 
out the peculiar difficulties with which we are faced in correlation 
studies in educational research. Another purpose is to describe to 
the student certain special correlation techniques with which he 
may not have become acquainted in any introductory course in 
educational statistics. 

In general, then, this chapter is intended only to supplement the 
general discussions of correlation theory elsewhere available, and is 
not intended to constitute à comprehensive and well balanced 
treatment of the subject. The content of this chapter is conse- 
quently miscellaneous and relatively lacking in unity, and the vari- 
ous sections may differ widely in their usefulness to students with 
highly specialized research interests. Unless this is clearly under- 


Stood, there is some danger that the student may secure à false 


impression of the relative importance of some of the techniques 


here discussed. Before reading this chapter, therefore, the student 
18 urged to review thoroughly the discussions of correlation meth- 


ods in two or three available standard texts in educational statis- 
tics and thus prepare himself better 


to see in their proper per- 
spective the specific problems whi 


ch will here be considered. 
* Recommended references are: 
Statistics in Psychology and Education, H. E. Garrett, Longmans, Green and Com- 
pany, 1937, Chapters IX- 


Statistical Methods for Students in Education, 
pany, Chapters IX, X. XIV and ХУ. А Ja 
A First Course in Statistics, E. F. Lindquist, Houghton Mifin Companv. 


Chapters X and XI. 


Karl J. Holzinger, Ginn and Com- 
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2. THE SIGNIFICANCE OF A PRODUCT-MOMENT CORRELATION 
COEFFICIENT 

The customary procedure in educational research for determin- 
ing whether or not an observed r is significant has been to compute 
the standard error of 7, using formula (33) below, and to describe 
the coefficient as significant if it is more than 2. 5 or 3 times the 
standard error. The formula used for c, is 

ууз 


EE (33) 


in which the r in the right-hand term really represents the true 7, 
but for which the obtained r is substituted to secure an estimate 
of с,. 

There are two principal objections to this procedure. In the 
first place, it is inconsistent to use the obtained coefficient (r,) as 
an estimate of the true r when testing the hypothesis that the true 7 
iszero. Under this hypothesis, our estimate of the standard devi- 


H = 5 1 
ation of the sampling distribution of کے کے‎ 
ping of r, should be VN VN 
I— 5 e ce 
„y. T avoid this inconsistency, we should 
describe an obtained correlation coefficient (r.) as significant at 


ТА 2.56 
the x per cent level if it exceeds VT rather than if it exceeds 


rather than 


2.56(1 — 7? 
азб) This, however, would assume that the sampling 
VN : 


distribution is normal, and it is here th 
arises. When Y is small (and the true r i 
tribution of r, differs slightly from the no 
even though we avoided the first objec 
normal probability integral table to in 


at the second objection 
S zero) the sampling dis- 
rmal distribution. Hence, 
tion, we could not use the 
terpret the standard error 


exactly. 
The appropriate procedure : for determining whether or not an 7 
* R. A. Fisher, 


Statistical Methods for Research Workers, 6th ed., p. 196. 
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obtained from a small random sample is significant is to compute 
the value of 


T 
== rus $9 
in which the r is that obtained from the sample. This / may be 
evaluated by means of Table 3, using № — 2 as the number of 
degrees of freedom. 

For any given size sample, one can compute the minimum value 
of r that will be significant at any given level, by substituting in 
(34) the value of V and the value of і needed at that level, and 
solving for r. This has been done by the writer (at the 5 per cent 
and т per cent levels) for a selected number of values of N, and the 
results presented in Table 13. For example, in a sample of 6o 
cases r would have to exceed .254 to be significant at the 5 per cent 


level, or would have to exceed .330 to be significant at the 1 per cent 


level. Given this table, the student will have little occasion to use 
n lies between 


formula (34). If the size of the sample in questio: 
two of е W’s given in Table 13, it will be sufficiently accurate for 
most practical purposes to use the value of 7 for the nearest N given, 
or to interpolate linearly between the two nearest values given. 

It may be well to show, in terms of a specific illustration, how 
misleading may be the procedure described at the beginning of this 
Section. Suppose that we have obtained anr of .54 from a random 
sample of 16 cases. According to (33), the standard error of this r 
is туу. The observed r is more than 3 times this value. Hence, 
if we followed the procedure first described, we would conclude 
that the observed 7 is significant well beyond the 1 per cent level. 
From Table 13, however, we see that for a sample of 16 cases an 7 
must exceed .623 to be significant at this level. 


FA PRODUCT-MOMENT CORRELATION 


COEFFICIENT 
The customary procedure for describing the reliability of a 
product-moment coefficient of correlation has been to compute its 


3. THE RELIABILITY O 
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TABLE 13 
VALUES OF CORRELATION COEFFICIENT REQUIRED FOR SIGNIFICANCE 
AT THE 5 PER CENT AND I PER CENT LEVELS FOR SAMPLES OF VARIOUS 
Sizes (N) 


standard error by (33), and to interpret the standard error by 
means of the normal probability integral table. This procedure is 
not valid for high values of r even though the sample is large. 
When the true correlation approaches = r.oo, the sampling dis- 
tribution of the 7’s obtained even from large samples will be mark- 
edly skewed, and any interpretations of the standard error based 
on the normal probability integral table may be very seriously mis- 
leading. However, Fisher has shown that for any value of 7 the 
function 

Т7 " 


Oe Д 


z=} log, @5) 


in which 7 is the observed correlation, is very nearly normally dis- 
tributed, and has shown that the standard error of z is 
I 
(36) 

MANI S 
in which N is the number of cases in the sample. 

This z-function may be used to test any exact hypothesis * con- 

з The z-function may be used quite satisfactorily to test the null hypothesis that 


the true r is zero, but for such simple tests of significance the /-test of the preceding 
section is to be preferred as a slightly more exact and conservative test, 


с, = 
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cerning the population value of r (given the r obtained from a ran- 
dom sample), or may be used to test the significance of a difference 
between two 7’s obtained from independent random samples. 

In order that these uses of s may not require the student to deal 
with logarithms in transforming 7 to 2 by means of (35), the writer 
has prepared a table (Table 14) of values of z corresponding to 
various values of r. The manner in which this table may be used 
should be made clear by the following illustration. 
| Suppose that the r obtained from а random sample of 67 cases 
is .gr. Is the hypothesis reasonable that the truer is.95? To test 
this hypothesis, we first find the value of z corresponding to the 
Observed and hypothetical values, and then find the difference 
between these values of z. We then divide this difference by the 
standard error of the observed z, thus expressing the difference as a 
normal deviate which may be interpreted by means of the normal 
probability integral table. In this case, the z corresponding to the 
Observed r of .91 is 1.528, while that corresponding to the hypothet- 
ical r or .ọ5 is 1.832. The difference between these 2' is .305. The 
standard error of z is 1/ V 67 — 3 = .125; hence.the normal deviate 
equivalent of the difference is .305/.125 = 2-44- The probability 
that a measure selected at random from a normal distribution will 
deviate more than 2.44 ¢ from the mean is Jess than 2 in тоо. Hence, 
We could not very reasonably retain the hypothesis that the true 7 
1s as high as .95. 

It will be instructive, for the same sampl 
that the true r is .87, since this value is as far below the observed 7 
of .or as the value first tested (.9 5) was above it. In this case, the 
value of z for the hypothetical r is 1.333; hence the difference be- 
tween the observed and hypothetical 28 is .195. The normal 


deviate equivalent of the difference is then .195/.125 = 1.56, 


which is not significant even at the 1o per cent level. Hence, we 
ppose that the true 7 15.04 


see that while it is quite reasonable to su 

below the observed value, it is not reasonable to suppose that it is 
an equal amount above. This illustration should give the student 
some appreciation of the degree of skewness in the sampling dis- 


e, to test the hypothesis 
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tribution of 7 for high correlations, and should indicate how mis- 
leading is the practice of appending “probable errors” to 7’s of 
large magnitude. 

If desirable, this procedure may be extended to set up limiting 
values of the tenable hypotheses concerning the value of the true 7. 
Suppose, for instance, that for the preceding example we wished to 
know within what limits we could be confident at the 1 per cent 
level that the true r lies. Since 99 per cent of the area under the 
normal curve lies within 2.576 c of the mean, the normal deviate 
equivalent of the difference between observed and hypothetical z's 
must not exceed 2.576. In other words, in the limiting case 
Zg — 2, 

с; ч 


2.576 = 


Hence, the limiting values of z, are 
% = 2.576 0, = 1.528 + 2.576 X -125, or 1.206 and 1.850. 

The values of r corresponding to these values of z are .835 and .955; 
respectively. We may therefore be confident, at the т per cent 
level, that the true r lies somewhere within these limits. Again 
it is evident, from the fact that the lowe 
ably farther from the observed ғ of 
that the sampling distribution of 7 
values of the true 7. 

Let us contrast the result just obtained from that which would 
have been found had we followed the procedure described at the 
beginning of this section. According to (33), the estimated stand- 
ard error of an r of -91 for a sample of 67 cases would be .o21, from 
which we would have concluded that the limiting values of the true 
r (at the т per cent level) are .91 + .o21 x 2.576, or .856 and .964- 


r limit deviates consider- 
-91 than does the upper limit, 
is markedly skewed for high 


4. THE SIGNIFICANCE OF A DIFFERENCE BETWEEN 7'5 OBTAINED 
FROM INDEPENDENT RANDOM SAMPLES 
The test of significance of a difference between two observed 775 
is similar in character to that just described, but requires that we 
find the standard error of the difference between the 275 correspond- 
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TABLE 14 
VALUES OF 2 FOR VARIOUS VALUES ОЕ? 


r 2 f z 


.000  .000000 .423648 | .600 .693146 | -800 1.098610 


005 .oogooo |.205 -207946 | -405 .429615 | 605 -700995 .8og 1.112656 
sOIO .o10000|.210 .213171 |.410 .435610 | 610 - 708020 |.$10 1.127027 
015 .orsoor | .215 .218407 | -415 441635 | -615 .716022 |.815 1.141740 
.020 .о20003 |.220 .223656 | -420 447601 | -620 +725004 .820 1.156815 
.025 .025005 |.225 -228916 | -425 .453778 | -625 .733167 |.825 1.172272 
+030 .030009 | . 230 -234189 | -430 -459896 | -630 .741415 | 830 1.188134 


1035 1035014 | .235 239475 |-435 ~400046 | -635 -748733 1835 1.204425 
„одо 040021} .240 -244774 | -440 -472230 640 -758172 | -840 1.221171 
.045 .045030 | .245 -250086 | -445 .478447 | -645 1766687 | -845 1-238402 
050 .050042|.250 -255412 | -459 .484699 | -650 .775297 | -850 1.256150 
6055 .o55056 |.255 -260753 | -455 “490987 | .655 +784006 .855 1.274450 
.обо .обооу2 |.2бо -266108 .460 497319 “660  .792812 |.8бо 1.293342 
:065 .o65092|.265 -271478 | -465 "503671 | -665 -801723 .865 1.312868 
070 .оўотт5 |.270 -276863 | -470 .510060 | .б7о 1810741 | -870 1.333077 
"975 .075141 |.275 .282264 | -475 .516506 | -675 .819870 | -875 1.354022 
080 .o8o171 |.280 .287682 .480 -522983 68o -820112 |.880 1.375765 
085 .o85205 |.285 -293115 485 -529501 .685 .838472 .885 1.398373 
.090 .000244 | .200 -298566 |.490 .536059 | -690 8479054 | 890 1.421023 
+095 .095287 |.205 -304034 | -495 .542660 | -695 .857501 | -895 1.446504 


лоо .100335 |.300 -309519 | ° .540305 | -700 .867299 | -900 1.472216 
.105 .105388 | .305 -315023 |-505 1555004 | -705 877171 .905 1.499177 
MIO .110447 |.310 320545 | -519 .562728 | -710 .887182 | .010 1.527521 
MIS .115511/.31$ .326086 | -515 «569510 | -715 .897338 | -915 1.557407 
*120 .120581 | .320 «331646 | -520 .576339 | -720 .007643 | -920 1.589023 
125 .125657 |.325 -337227 .525 .583216 | -725 .918104 | .025 1.622593 
130 .130740 | .330 -342828 | -530 500144 | 730 028725 | -930 1.658386 
135 .135829 |.335 +348440 | -535 1507123 |.735 930514 :935 31:506794 
"140 140925 |.340 -354092|.540 -604754 "749  .950477|-040 1-739043 
2145 .146020|.348 -359756 | -545 .611240 | -745 961621 | -945 1.782838 
150 .1s1140 |.350 365443 | -550 "618380 |.750 -972953 | -959 1.831777 
.155 0156250 |.355 -371152 | 555 1625577 | -755  -984481 | 955 1.885737, 
x60 .161386 |.360 -376885 .5бо -632832 .760 996213 .обо 1.94590 
-165 .166522 | .365 .382642 «565 .640146 .765 1.008158 |.065 2.013945 
Е ЕЕЕ dE ЧЕ 
175. 5 К 59 |77 E А 

5 0176820 |.375 -394228 | -575 54959 Т8 1.032775 Mio 2197555 


:180 .181082 | .380 -400059 .58o -662461 
A85 187155 |. "oso16 |.s8s -670029 | -785 1.058265 |.9088 2442057 
7155 | 385 -405916 |-585 “677665 | 9o 1-071429 | -999 2.646647 


-I90 .192337 | -390 411799 
197520 


.685370 | -795 1.084873 | 995 2.904474 


ing to the observed 7’s. Suppose that the r between two variables 
for one sample of 35 caseS is .82, and for another sample of 42 cases 
is 89. The corresponding values of 2 are 1.1 568 and 1.4219, Te- 
spectively. The difference between these z's is .2651. The stand- 


ard error of the first 2 is 1/V 32, and of the second is 1/39. 


т 
32 39 


= .238. 


Hence, the standard error of the difference is 
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Clearly the difference is not significant, being barely larger than 
its standard error. Since such differences are approximately nor- 
mally distributed, the observed difference would have to be ap- 
proximately 2.576 times its standard error to be significant at the 
I per cent level, or about 1.960 times its standard error to be sig- 
nificant at the 5 per cent level. 

It should be noted that small differences in observed correlation 
coefficients will not be significant unless the samples employed are 
quite large. If, for example, the samples involved each contained 
Too cases, the standard error of the difference in z's will be 


[x т 
DEP m ‘14. Hence, the difference in z's must be 


atleast 14 x 1.96 = .2744 to be significant at the 5 per cent level, 
or at least .14 x 2.576 =. 


т. :94. It may be shown, 
similarly, that for samples of тоо cases each, differences between 
5 of .8o and .9o, or -70 and .84, or .so and :72, etc., would just fail 
to be significant at the т per cent level, 


If, then, one is planning an experiment or investigation the 


е values of z correspond- 
:9073 respectively, with a difference 
of the difference in z's must then 


not exceed .0920/2. 576 = .озбт. Непсе, solving for N in 


Өзб [i ی‎ 1T. 
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we get N = 1640, approximately. Unless the samples contained 
Over 1500 cases each, then, one could not (at the т per cent level) 
regard observed 7’s of .65 and .7o as surely indicative of a real dif- 
ference between the populations sampled. If the observed r's are 
high, the size of sample needed to render an observed difference of 
.05 significant at any given level will of course be less. For in- 
stance, a sample of 284 would be sufficient to render the difference 
between observed 7’s of .85 and .go significant at the 1 per cent 
level, and the same difference would be significant at the 5 per cent 
level for a sample of :96. 

It should be apparent, from the foregoing, that correlation 
studizs in general require relatively large samples, and that many 
of the correlation studies which have been reported in the literature 
оп educational research were doomed to inconclusiveness before 
they were conducted. 

The student should note carefully that the 
been described for determining the significance of а difference be- 
tween correlation coeficients is valid only if the coefficients are 
Obtained from independent random samples. If both are obtained 
from the same sample, and if there is any correlation between 
either of the variables involved in the first correlation and either 
involved in the second, the test just described would beinvalid. In 
this case the coefficients obtained from a large number of similar 


Samples would themselves show a positive correlation, and this 


relationship would result in a reduced standard error of the differ- 
ence. Suppose, for example, that in a study of the relative validity 
of two tests, A and B, in relation to a criterion C, rac and ьс were 
computed for the same sample. The difference in these “validity 
Coefficients” could not be evaluated in the manner described in the 
Preceding paragraphs; or, if this test were applied, the estimated 
standard error of the difference would be considerably larger than 
the true standard error. Similarly, the difference in the reliability 
Coefficients of two tests, if computed for the same sample, might 


be really significant but fail to appear 50 by this test. 


procedure which has 
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Until very recently, no satisfactory test of the significance of 
a difference between correlated coefficients was available. Since 
the first printing of this text appeared, however, the author has 
learned through correspondence with W. C. Cochran that both he 
and H. Hotelling have independently reached the same solution 
to this problem. According to Cochran, the difference between 
7,, and r, for a random sample of n cases may be tested by com- 


puting 
(ie — n) Vn- BV I+ te, 


Oo‏ ك 
м2 Мт- паа 02 + 2 ты‏ 


t having ( — 3) degrees of freedom. In this expression the coef- 
ficients 7,, and Tı, are given the same sign before taking the dif- 
ference, the test being merely one of their absolute magnitude and 
not involving signs. For example, if the scores on tests A, B, and 
C for a random sample of 147 cases show correlations of 745 = .68, 
rac = -77, and rpc = .84, then the t for TAB — Tac is equal to 3.0, 
which is significant well beyond the x per cent level. Had the test 
for independent samples been incorrectly applied, this difference 
would have appeared to fall short of the 10 per cent level of sig- 
nificance. 

It may be well to emphasize again that the tests of significance 
described in this section are valid only for random samples. If the 
sample is a stratified sample, or if it consists of several intact school 
groups which show systematic differences in the correlated vari- 
ables, or which show systematic differences in the correlation be- 
tween these variables, the procedures which have been described 
in this section will not yield valid estimates of error. This 
problem will be considered in Section 6, following. 


5- COMBINING CORRELATIONS FROM SEVERAL SAMPLES 
The z-transformation described in the preceding section may be 
used to estimate the correlation for a population if the observed 
correlations are known for several samples independently drawn at 
random from that population (or from equally correlated popula- 
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tions). The procedure is very simple, involving only the compu- 
tation of the weighted average of the z's corresponding to the vari- 
ops r’s, cach z being weighted by (N — 3). The estimate required 
is then the value of 7 corresponding to this average 5. 

Suppose, for exam cle, that for samples of 80, 126, and 92 cases, 
the values of the r between two given variables are .640, .720, and 
:595 respectively. What is the best estimate of the r for the popu- 
lation from which these samples were drawn? The corresponding 
values of z are .758172, -907643, and .685370 respectively. The 
weighted average of these z's is 
[C758172)(77) + (907643) (123) + (685379) (89)] + 289 — .799368. 
(The standard error of this z is 1/ ^/289 or 0588.) Тһе corre- 
sponding to this weighted average is .665, which is the estimate 


desired. 
It should be emphasized that this method of combining 7’s should 


not be employed unless the samples involved are known to be inde- 
pendent random samples from the same pop 
correlated populations). 


ulation (or from equally 


б. THE EFFECT OF SCHOOL DIFFERENCES UPON CORRELATION 
COEFFICIENTS OBTAINED FROM HETEROGENEOUS SAMPLES 


CONSISTING OF INTACT SCHOOL GROUPS 


The tests of significance described in the preceding sections of 
Je random samples. How- 


this chapter are all designed for simp 
ever, as we have noted repeatedly throughout this text, samples 
consisting of intact school groups may not be considered as random 
samples of pupils. Since most of the samples used in correlation 
Studies in educational research do consist of intact school groups, 
it will be well to consider carefully the effect of school differences 
upon the correlation coefficients obtained from such samples. 

Let us consider first the effect of the larger-than-chance differ- 
ences in school means which characterize the results obtained from 


? Jack W. Dunlap, 


J * See “Combinative Properties of Correlation. Coefficients, 
ournal of Experimental Education, March, 1937 (Volume 5, page 286), for a general 


method of combining correlations from different groups. 
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objective tests of school achievement. Suppose, for example, that 
we wish to determine the correlation between scores on a certain 
arithmetic test and a certain reading test for seventh-grade pupils 
in Iowa schools. Suppose we decide to use a sample of 300 pupils, 
and find that this number of cases may be most conveniently 
secured by using all seventh-grade pupils in two school systems, 
each of which can provide 1 şo pupils. Now let us suppose that the 
pupils in School A are doing outstandingly good work in arithmetic 
and unusually poor work in reading, and that the reverse is true for 
the pupils in School B. If the scores for all 300 pupils were then 
plotted on the same scattergram, the tally marks might be dis- 
tributed as in diagram I below. The tally marks for the pupils 
in School A would lie at the lower (left) end of the reading scale, 
and at the upper end of the arithmetic scale, or in the oval labeled 
A. The long narrow shape and orientation of the oval indicates 
that the correlation between reading and arithmetic scores is posi- 
tive and high. The B oval in diagram I may be similarly inter- 
preted. If now we were to compute the correlation coefficient for 
the entire Scattergram, we would find a marked negative total 
correlation between arithmetic and reading scores for the two 
schools, even though the correlation within each school is positive 
and high. 


I п 
Reading 


ZAL 


Diagram I of course represents a deliberate exaggeration, for the 
sake of emphasis, over what one would be at ll likely to find in an 
actual situation of the type suggested, Only very rarely would the 
total correlation for two or more Schools be negative at the same 
time that the correlations within Schools are positive and high. 
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It frequently happens, however, that the total correlation is mark- 
edly lower than within any individual school, as is suggested by 
diagram IL More frequently, the total correlation would be 
higher than within any individual school, as is suggested by dia- 
gram III, and sometimes, but not often, the total correlation 
might be very nearly the same as within the individual schools, as 
could be true of diagram IV. 

The manner in which this problem may be handled is directly 
Suggested by the phraseology of the preceding discussion. What 
we should compute, in situations of this kind, is the correlation 
within schools, rather than the total correlation, and the manner in 
which this may be done has already been described in Chapter VI. 
Our real interest, in the example cited, would be in the correlation 
between arithmetic and reading scores for pupils all of whom have 
had the same instruction in these subjects, and this is essentially 
what the correlation within schools, as computed by the methods 
of analysis of covariance, represents. The computational proced- 
ure involves analyzing the sums of squares of the arithmetic and 
reading scores into their between schools and within schools com- 
Ponents, computing the sum of products for within schools, and 
then computing the correlation within schools from the sums of 
Squares and products within schools, using formula (24) on page 
186. This correlation could then be interpreted as the average of 
the correlations that would be found in the separate schools, or, 
with reference to diagram I, for instance, aS that which would be 
found if the “A” and “В” ovals had been moved together, or 


superimposed, so that the school means coincided for each test. 
is unaffected by differences 1n 


The correlation within schools, then, is e to that 
School means, and hence may be const ered as equivalent to that 
Which would have been found had all pupils been taken from а 
single school. Furthermore, if one may assume that in the popu- 
lation of schools involved the correlation within any one school is 
the same (except for chance) as in any other (the assumption of 


homogeneous correlation), à within schools correlation computed for 
a number of randomly selected schools may be treated as if it had 
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been secured from a simple random sample of № — m + т cases (W 
represents total number of pupils and m represents the number of 
schools). For instance, the within schools correlation for a sample 
of 420 cases in 11 schools would be treated as a correlation coeffi- 
cient obtained from a simple random sample of 410 cases, and any 
of the techniques described in the preceding sections could then Le 
applied to this correlation. This would not be true, however, if 
the assumption of homogeneous correlation within schools is not 
satisfied, and this possibility will be more adequately considered 
later. 

This effect of school differences in an actual instance may be 
illustrated by the data used in the example of Section 5 of Chapter 
VI (pages 196 to 203). These are actual data secured from five 
Iowa schools selected at random from over 250 available schools. 
We noted in this example (page 203) that the correlation within 
Schools (between initial and final scores) was .66. The correlation 


for the total sample may be computed from the sums of squares 
and products for total, as follows: 


> " 
r (total) = ap (for total) 
аз. 
= 42327 


8 کک ر 
X 49999.1 í‏ 69495.0 


This correlation was reduced by the methods differences, as may 
be demonstrated by computing the correlation with methods dif- 
ferences (but not school differences) eliminated. The value of 
2 x for the total sample was 69495, and for methods alone was 
1319.5, hence the value of X 4? for within methods is 69495 — 1319-5 
= 68175.5. Similarly, the X xy term for within methods is 
42327 — (— 1352.6) = 43679.6, and the Х y* term for within 
methods is 49999.1 — 1382, 5 = 48616.6. Accordingly, the cor- 
relation with methods differences (only) eliminated is 


piss 43679.6 


68175.5 X 486106 77^ 
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The effect of the school differences alone, then, was to increase the 
correlation from .66 to .76 for this sample of five schools. 

The effect of school differences in central tendency upon correla- 
tion coefficients computed from samples consisting of intact school 
groups is of particular concern in the evaluation of achievement 
tests intended for widespread use. When reliability coefficients or 
validity coefficients are computed for an objective test, they are 
usually obtained from heterogeneous samples containing pupils 
The test user is of course primarily inter- 
ested in the reliability or validity of the test for use within his own 
school, and therefore, aside from any other considerations, the 
reliability or validity coefficients within schools are those which 
logically should be used in describing the tests. Furthermore, 
within schools correlations computed for samples of this type are 
likely to be considerably more stable than the total correlations 
for the same samples. The writer would therefore strongly recom- 
mend the method of analysis of covariance as a standard procedure 
in computing the reliability and validity coefficients of standard- 
ized tests, in order that the more stable and more meaningful 
within schools correlations may be reported to the test user. — 

We have already noted that when a within schools correlation is 
computed for a sample of intact school groups, the reliability or 
Significance of this correlation may be validly determined by the 
Procedures of Sections 2 to 4 of this chapter only if the correlation 
within individual schools is fundamentally constant for the popu- 
lation of schools involved. The possibility that there may be real 
differences in correlation from school to school is therefore one to 
which we should give very careful consideration. Tf the true cor- 
relation between two variables differs markedly from school to 
School, then the within schools correlation computed for a number 
of schools becomes an “average” of dissimilar correlations, and will 
therefore be both relatively unstable and difficult to interpret, if, 
indeed, it may be considered as valid “average” at all. 

This possibility that correlations between scores on the same 
tests may differ significantly from school to school was shown in a 


from several schools. 
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study, by G. V. Lannholm, of the relative reliability of several 
different types of tests of punctuation and capitalization ability for 
junior high school pupils? Among other things, Lannholm at- 
tempted to determine the relative validity of certain self-admin- 
istering objective tests of capitalization by computing the cor- 
relation of scores on each test with the scores on a criterion test of 
the dictation type. Six different types of capitalization tests, 
designated as types A to F, were evaluated. Each of these tests, 
including the criterion test, was based on exactly the same con- 
tent, and differed from the others only in form. The whole study 
consisted of а number of independent experiments, in each of 
which the criterion test and two of the self-administering tests were 
administered to the pupils in several schools, different schools being 
involved in each experiment. Test A, which consisted of uncap- 
italized printed sentences in which the pupil was to indicate the 
places where capitals were needed, was paired with one of the other 
tests in each of five experiments, so that five independent observa- 
tions were made of the correlation of test A with the criterion, or of 
the “validity” coefficient of test A. In all of these five experi- 
ments, both test A and the criterion were administered under the 
Same very carefully controlled conditions, The correlation com- 
puted for each experiment was the within schools correlation, 50 
that the effect of differences in school means was eliminated. The 
results of these five experiments were as follows: 


Within Schools 


Experiment Number of | Number of Correlation of — Within Schools 


Number Pupils Schools Test A and Reliability of 
Criterion Test Test A 


Mean = .58 


* “The Measurement of Punctuation and Capitalization Ability,” С. V. Lannholm, 


Journal of Experimental Education, September, 1939, vol. 8, no. I, pp. 55-86. 
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à It may be readily shown that the variation in these correlations 
is much greater than can be attributed to chance fluctuations in 
random sampling. Inother words, the evidence is very convincing 
that the true “validity” of test A, as judged by the criterion em- 
ployed, is considerably higher in some schools than in others. 
Larger-than-chance variations were also found in the reliability 
coefficients for test A, independently computed from the within 
schools correlation between equivalent forms of the test in each 
experiment, and these are given in the last column of the preceding 
bible. Admitting that the procedure was questionable, Lannholm 

averaged" these validity coefficients for the five experiments by 
the method of Section 5 of this chapter (see last line of table above), 
but wisely refrained from using the procedure of Section 3 to esti- 
mate the precision of this average correlation. Very obviously, 
this average coefficient is less stable and less meaningful than if it 
had been derived from individual coefficients that differed only as 
much as would coefficients obtained from random samples of the 


same number of cases. 

In order to secure a more dep 
of the degree to which heterogenei 
on educational tests characterizes 
school groups, the writer and J. H. L 
secured from the 1939 Iowa Every- 

Grades6, 7,and8. The data used were the scores made by seventh- 
1 Tesis of Basic Skills. 


grade pupils on the 7989 Iowa Every-Pupi 
Test 4 of this battery is a 67-minute objective test of silent reading 


comprehension. Part V of Test A is a 10-minute test of general 
vocabulary. Test Bisa 78-minute test of work-study skills, in- 
cluding tests of ability to read maps, to read graphs and tables, to 


use an index, to use a dictionary, etc. TestCisa 70-minute test of 
basic language skills, including capitalization, punctuation, usage, 
and spelling. Test D is an 8o-minute comprehensive test of 
achievement in arithmetic. All tests were administered under the 
Same very carefully controlled conditions in all schools participat- 


ing in the program. From each of 71 schools, 3° pupils were 


endable and generalized description 
ty of correlation between scores 
populations made up of intact 
yford analyzed certain data 
Pupil Testing Program for 
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selected at random from all seventh-grade pupils in the school, and 
for each of these groups separately various inter-correlations were 
computed for total scores on the tests just described. The dis- 
tributions of these correlation coefficients are given in Table r5. 
The heading of each column of frequencies identifies the tests in- 
volved. For example, the first distribution is that of ғ ap, or of the 
correlations between total scores on the reading and arithmetic 
tests. 

For each coefficient obtained the corresponding z was determined 
from Table 14 (page 215). The distributions of these z's for 
each distribution of 775 is given in the lower part of Table 15. Had 
these z's been obtained from independent random samples of 
3o cases each all drawn from the same population, the standard 
deviation of any of the z-distributions should approximate 


I 2 
ММ — з = туе а 192. The actual standard deviations, Ts, 


are given at the bottom of Table : 5. It will be noted that in each 
case the actual standard deviation is larger than that expected 
on the hypothesis of random sampling. For example, the stand- 
ard deviation of the z's corresponding to gc is .285, or over 50 
per cent larger than expected. These discrepancies are larger than 
could be attributed to chance (as may be shown by applying 
exact tests). 

If these studies are at all representative of the conditions that 
generally prevail in correlation studies in educational research, the 
problems of designing such studies and of interpreting the r-sults 
are far more complex than we have heretofore supposed them to 
be. If, for example, the true reliability of a test fluctuates widely 
from school to school, then, in order to secure a meaningful and 
dependable generalized description of the test’s reliability, we must 
determine its reliability coefficient separately for each school in à 
fairly large and representative sample of schools, and present the 
actual distribution of these reliability coefficients. The average 
or median of these reliability coefficients might then be taken as а 
generalized description of the reliability of the test, and the relia- 


TABLE 15 


DISTRIBUTIONS OF OBTAINED 7’S AND 2’S (EACH COMPUTED FOR A RANDOM 
SAMPLE OF 30 SEVENTH-GRADE PUPILS IN A SINGLE SCHOOL) FOR SCORES 
ON CERTAIN TESTS IN THE “1039 IOWA EVERY-PUPIL Tests ОЕ BASIC 


SKILLS" In 71 IOWA SCHOOLS 


Frequencies 
Values of r d 
i For Tests А For Tests B For Tests A-V 
(Lower Limits and D (ғар) and C (вс) and C (ra-v-c) 


of Intervals) 


2 

6 I2 I 
8 16 5 
7 7 9 
‚бо її 5 6 
55 15 7 II 
.50 6 I 9 
45 5 I 4 
.40 2 I A 
35 3 I Е 

.30 2 I 
2 3 
ї 
I 
I 
I 


Values of 
z (Lower Limits ZAD 
of Intervals) 


ZBC 


сә 


t). N HB N у < боол ха О 3 Cn 


OO 
С: = .226 0s = .285 


Expected os = -19? 
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bility of this description would depend on the variance and number 
of the observed correlations for individual schools. Clearly, under 
these conditions, a reliability coefficient computed for only a few 
Schools, regardless of the number of pupils involved, would be 
relatively unstable, even though school differences in means have 
been eliminated by computing the within schools correlation. А 
total coefficient of reliability computed for an entire sample involv- 
ing only a few schools can obviously be only less meaningful and 
dependable, and it must be remembered that it is only this type of 
coefficient which has thus far been provided with most current 
standardized tests. It is to be hoped, therefore, that further in- 
vestigation will reveal a less pronounced heterogeneity of within 
Schools intercorrelations апа self-correlations for educational 


achievement tests in general than is suggested by Lannholm’s and 
Lyford’s data, 


7. THE ESTIMATED VALIDITY OF A TEST 7 TIMES AS LONG AS 
A GIVEN TEST 

In experimental comparisons of objective testing techniques, 
the standard procedure has been to construct several forms of а 
test using the same content but employing a different technique 
with each form, to administer all forms together with a criterion 
test to the same group or to equated groups of pupils, and then to 
evaluate the forms in terms of their correlations with the criterion 
and their reliabilities. Suppose, for example, that we wish to com- 
pare the right-wrong type of spelling test, consisting of a printed 
list of words in which the Pupil is to indicate which words are 
spelled correctly and which incorrectly, with the multiple-choice 
type in which each word is presented in, say, four spellings, and in 
which the pupil is to indicate which spelling is correct. 

We might then prepare a list-dictation test (in which the words 
are dictated to the pupil) for use as a criterion (assuming a priori 
that this form is more valid than either of the experimental forms), 
and then build parallel forms of the right-wrong and multiple- 
choice types, each form being based on the same words as the cri- 
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terion test, and then administer all three tests to the same group 
of pupils. Suppose that experience has shown that pupils require 
more time per item for the multiple-choice type than for the right- 
wrong, and that in the experiment the two forms are therefore ad- 
ministered under different time limits, arbitrarily determined. 
We might then find that the multiple-choice test, which required 
igher correlation with the criterion. How- 


more time, yielded the hi 
words been included in the right- 


ever, it is possible that had more 
wrong form and the time increased proportionately, so that the 
administration time for both tests were the same, the right-wrong 
test might have shown the higher correlation with the criterion. 
Tf so, the right-wrong would be the better testing technique for the 
given amount of testing time, since the practical question is, 
“Given a certain amount of time, with which technique can the 
most valid score be obtained?” This, for instance, would be the 


problem faced by a constructor of a comprehensive achievement 


test who has a certain number of minutes to devote to the testing 


of achievement in spelling. 
This being the case, we might wish to estimate, from the known 


validity and reliability of the original right-wrong test, how valid 
it would have been had it contained a sufficient number of words 
(of homogeneous quality and difficulty and administered at the 
Same rate) to make the total administration time the same as for 
the multiple-choice test. This could be done, according to 


Kelley; by means of the formula 


ERY. (87) 
y Í = fu 


—— + 


т 
in which r, is the correlation of the short form with the eae 
7s is the reliability coefficient of the short form, nis the ratio of the 
lengths of the short and long forms, and 7i; 15 the estimated cor- 
relation of the long form with the criterion. - (If desired, the 
Spearman-Brown Prophecy formula could be similarly used 5 esti- 
mate the reliability of the long form.) 

* Statistical Method, Truman L. Kelley; р. 200: 
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Suppose, for example, that our original right-wrong test, admin- 
istered in 5 minutes, showed a correlation of .7o with the criterion 
and a reliability of .72, and that the multiple-choice test, admin- 
istered in 18 minutes, showed a correlation of .75 with the criterion, 
and a reliability of .95. According to (37), then, a right-wrong test 
3.6 times as long, and therefore administered in 18 minutes, would 
Show a correlation with the criterion of 

Te = 9ے‎ .784. 
3.6 

According to this estimate, a right-wrong test 18 minutes in length 
(administered at the same rate in words per minute as the experi- 
mental test) is more valid than а multiple-choice test 18 minutes in 
length (also homogeneous with and administered at the same rate 
as the experimental test). It is significant, however, that if we 
estimate the validity coefficient of a s-minute multiple-choice test 
by the same formula, we get a value of .706, as compared to a valid- 
ity coefficient of .7o for a right-wrong test of the same length. In 
other words, which of these two form 
length of test we wish to build. 


(Wote: The data used in the preceding example are fictitious, and 
ате in no sense indicative of the relative merits of the right-wrong and 
multiple-choice types of spelling test.) 


sis superior depends upon the 


Formula (37) is appropriate if there is available only one form 
of the test, and if Tss is estimated from the correlation between 
Scores on chance halves by means of the Spearman-Brown for- 
mula. If two equivalent forms of the short test have been admin- 
istered, it would be better to define s in (37) as the sum of the scores 
on the short forms, in which case Tss would be the reliability of this 
total score estimated by the Spearman-Brown formula from the 
correlation between scores on the short forms. In this case, of 
course, the z in (37) would represent the number of times the long 
form (whose correlation with the criterion is being estimated) is 
longer than the combined short forms. 
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Formula (37) is important, not so much because of its possible 
applications in research, as because its careful study should add 
considerably to the student's insight into the problem of test eval- 
uation. So far as the technique of testing involved is concerned, 
the important characteristic of the technique is the validity which 
it will yield per unit of time, but the validity per unit of time de- 
pends, among many other things, upon the length of the test and 
the relation of the reliability to the validity for any given length. If 
a test has a very high reliability and a low validity for a given 
length, then it is of relatively little avail to attempt to lengthen 
the test to secure a higher validity (assuming the material added 
is homogeneous with the original), but neither will the validity be 
lowered seriously if the test is shortened in length. On the other 
hand, if the reliability coefficient is only slightly higher than the 
“validity” coefficient for a given length of the test, then an in- 
crease in length will raise the validity coefficient almost as rapidly 
as it raises the reliability coefficient. Hence, one technique may 
be better than another if a short test is desired, but worse than 
another if a long test is desired. Strictly, therefore, experimental 
comparisons * of testing techniques should be designed to answer 
the question, * Which technique will yield the highest validity in а 
given time period?" rather than the insolvable question, * Which 
technique is in general the most valid for measuring a certain 
Outcome?" 

It is important to note, in the preceding example, that since the 
€xperimental administration times were arbitrarily determined, 
there is a possibility that more time than necessary was allotted 
to the multiple-choice form, and that the latter form was therefore 
unfairly penalized in the comparison. This factor could be con- 
trolled experimentally if the purpose of the experiment were to 
determine which form of test yields the more valid score for a given 
time limit. One could then build several right-wrong tests con- 
taining different amounts of homogeneous material, administer all 


; See “Experi i luation,” Lindquist, E. F., and Cook, 
xperimental Procedures in Test Evaluation, quist, y 
Walter W., Journal of Experimental Education, March, 1933 (Vol. I, p 163). 
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tests under the given time limit, find which correlates most highly 
with the criterion, and thus determine empirically the optimum 
amount of material for the given time limit and the given tech- 
nique. If the same were done for the multiple-choice test, оде 
could then fairly compare the validity coefficients for the given time 
limit, both techniques having been applied to the optimum amount 
of material for that time limit. 

It should be apparent, from formula (37), that while the core 
relation of a test with a certain criterion may be increased by in- 
creasing the length of the test there is an upper limit to the correla- 
tion which may thus be secured, and this limit is set by the relia- 
bility of the original test. If we let м in (37) approach infinity, 
"е approaches the limit (see also the last paragraph on page 230). 

EUM 38 

бы = (38) 

, the limiting value of the correlation 
gthened test of the right-wrong type 


In our example, for instance 
with the criterion for a len 
would be estimated as 


то 
loc = Vaz = 82 

while that for a lengthened multiple-choice test would be 77. In 
other words, the estimated correlations between obtained scores 
on the criterion test and true * scores on the experimental tests are 
.82 and .77 respectively. 

Formula (38) can of course be used also to estimate the correla- 
tion between a fallible test and an infallible criterion, by inter- 
changing s and c in the formula. That is, one would divide the 
correlation rz, by the Square root of the reliability of the criterion, 
rather than of the fallible test. This form of (38) is of much greater 
practical interest than the form given above, since it provides a 
measure of the validity of the test which corrects for errors in the 


1 True scores are perfectly reliable measures 
uring. The true score of an individual on a 
Score on an infinite number of equivalent fo 


of whatever a test happens to be meas- 
given test may be defined as his mean 
rms of the test. 
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criterion and at the same time takes into consideration the errors 
which would characterize the test in actual use. 

A means of estimating the correlation between írue scores on 
both the criterion and the experimental test will be presented in 
the following section. 


8. CORRELATION COEFFICIENTS CORRECTED FOR ATTENUATION 

Correlations between scores on educational and psychological 
tests are systematically lowered (attenuated) as the result of errors 
of measurement. That is, the correlation between obtained scores 
on any two fallible tests will be lower than the correlation between 
true scores on the same tests, due to the fact that the obtained scores 
are in part the result of chance (and hence uncorrelated) errors of 
Measurement. Spearman’ has shown that the correlation be- 
tween true scores on two tests for any group may be estimated by 
means of the formula 

CN 9 
hae, ке (39) 
in which r, is the estimated correlation between tests т and 2, 
Whose reliability coefficients for the given group are rr and 71; re- 
Spectively, 

Suppose, for example, that for a certain group of pupils the cor- 
relation between scores on two given spelling tests is .74, and that 
the reliability coefücients of these tests for the given group are 
:97 and .84, respectively. According to the “correction for atten- 
uation” formula, the estimated correlation between true scores for 

S group is 

r ov. es 
ıa V97 X 84 
This Signifies that if we added homogeneous material to each of 
these tests to increase its reliability, we could not expect the cor- 
relation between the scores to exceed .82, no matter how long we 
Made the tests. In other words, à correlation of .82 would pre- 
85 т True Measurement of Correla- 


ion аттап, Cy “р ion of Formulae fo 
t y Le, emonstration о т 
10n,” American Journal of Psychology, Volume 18 (1907), р. 161- 
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sumably be found between scores on infinitely long and hence per- 
fectly reliable tests of these types. This of course implies that 
whatever is being measured by one of these tests is not exactly the 
same as that being measured by the other. If they were measur- 
ing the same thing the corrected coefficient should be r.oo, since 
the correlation between perfectly reliable measures of the same 
thing must obviously be unity. If, then, the scores on the first of 
these tests were considered as criterion measures of spelling ability, 
the "corrected" coefficient of .82 would mean that the second type 
of spelling test is in part measuring irrelevant factors, and could 
not possibly be highly valid, no matter how long the test were 
made. However, even though the criterion is beyond question, 
this should not be taken as evidence that this particular test (i.e., in 

. its original length) is not a “good” measure of spelling ability for 
a test of its length, or that it is necessarily inferior to another par- 
ticular test whose “corrected” validity coefficient is above .82 (see 
next to last paragraph of the preceding section). Suppose, for in- 
stance, that the scores on a third spelling test showed a correlation 
of .65 with the criterion scores for the same pupils, and that this 
correlation when corrected for attenuation was raised to .88. This 
would signify that if both tests were made infinitely long, the sec- 
ond type of test would be inferior in validity, but if the original 
tests were to be used as found, the second type of test would never- 
theless be the more valid. 

The mistake has frequently been made of interpreting a correlation 
coefficient corrected for altenuation as the “true” correlation between 
the traits which the tests are supposed to measure, rather than as the 
estimated correlation between perfectly reliable measures of whatever 
the tests actually do measure. This is but one of the many instances 
of the so-called “ jingle fallacy” in interpreting test results, that is, 
the fallacy of dealing with test scores as if the tests were really 
measuring what their names or titles imply that they are measur- 
ing. It should be noted, therefore, that the “errors of measure- 
ment” whose effects upon correlations are presumably eliminated 
by the correction for attenuation are only those which are due to the 
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lack of perfect reliability. If the tests are measuring the wrong 
things, a correction for attenuation will only indicate the correla- 
tion between true measures of these “wrong” things. 

It should be noted also that the correction-for-attenuation for- 
mula is based upon an assumption that in some instances may be 
quite doubtful. This is the assumption that the errors of measure- 
ment in the two tests are unrelated. Furthermore, the “cor- 
rected” coefficient is subject to whatever sampling errors* are 
present in the obtained correlation and reliability coefficients. (As 
à result of these errors, and of errors due to the way in which the 
reliability coefficients are estimated, the corrected coefficient may 
Sometimes exceed 1.00.) Correlations corrected for attenuation 
Should therefore always be considered as only approximate in 
us but if so considered may nevertheless prove highly 

seful, 


9. TEST FOR LINEARITY OF REGRESSION 
Sometimes, after having constructed a correlation diagram to 
compute the coefücient of correlation between two variables, we 
May note an apparent tendency for the means of the columns (or 


rows) to fall along a curved rather than a straight line. In other 


Words, the distribution of tally marks may suggest that the rela- 


tionship is non-linear. If it is, then of course the product-moment 
Correlation coefficient will underestimate the degree of relationship 


actually present, and estimates based on the linear regression equa- 


tion may be seriously biased. Before using these latter techniques 


9n the hypothesis of linearity, we should, in cases of doubt, satisfy 
“selves that this hypothesis is tenable. That is, we should dem- 
"strate that the observed deviations of the column (or row) means 
e à straight line pattern could reasonably be considered as due 
nly to chance fluctuation in random sampling. 
The hypothesis of linear regression may be readily tested by the 


relation Functions and Their 


*See C 
S ureton, E., “ in Estimated Cor 
ta; , “On Certain Esti VoL aNMasch, rog 6); pp-a52- 


? n 
263 m Errors," Journal of Experimental Education, 
» formula (23). 
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methods of analysis of variance and covariance. Suppose that the 
scales on our correlation chart are labeled the X and Y scales in the 
customary fashion, that y represents the deviation of any Y-meas- 
ure from the mean of all the Y's, and that x has a similar meaning. 
Let us first consider the possibility that the column means deviate 
significantly from a straight line pattern. 

The first step in the procedure would be to analyze the total 
"sum of squares" (ss, = У y?) for the Y-distribution into its 
between columns (55,) and within columns (55) components. This 
would be done by the methods of Section 2 of Chapter V, viewing 
the correlation table just as we viewed Table 6 on page 93. Ac- 
cordingly, 

$5ш = SS, — SSe = È y? — ss, 
The next step would be to compute the total “sums of products," 
Z ху, by the method described on page 185. We may now note 
that the sum of the Squared deviations of the Y-measures in the 
individual columns from the regression line (y = bx) is given by 
(see page тоо) 


Z(y-bixy-Zz y- 9) 


in which Z a? is of course the total sum of squares for the X-distri- 
bution. Now, if the observed means had fallen exactly on the 
straight regression line, then the sum of squared deviations from 
the regression line would be the same as the sum of squares within 
columns. That is, SSw would equal Z(y — bxy. If this were true, 
zx 
Since all column means do not fall on the regression line, ss; will 


(® xy) 


PrE and the difference between these terms will 


then by the two equations above, ss, would have to equal 


be larger than 


be indicative of the amount of departure from linearity. We 
may think, then, of the sum of squares berween columns as consist- 


йо A (Qe. . p 
ing oi two components, one of which, ; is due to linear T 


Za 
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„ўз 
gression, and the other of which, 55, — eu is due to departure 
from linearity. The component due to linear regression has one 
degree of freedom, that due to departure from linear regression 
therefore has one less degree of freedom than the sum of squares 
between columns. If the departure from linearity is due only to 
chance, then the variance estimated from the sum of squares due to 
departure from linearity should be the same as the variance within 
columns. If the F-test shows a significant difference between 
these variances, we may take this as evidence that the regression 
1s non-linear. 

Suppose, for example, that we have a correlation chart in which 
15 columns contain frequencies, and that 500 cases are involved. 
Suppose the analysis of the V-distribution is 


Sum of 
Squares Variance 


Between columns 14235.22 
Within columns 5126.30 10.57 


Total 19361.52 


Suppose also that guy = 13,782.72. The variance between. col- 
a 
“ns could then be analyzed as follows: 


Sum of . 
Squares Variance 


То 
E (between columns) 14 14,235.22 
9 linear regression (subtract) 13,782.72 


Че to departure from linearity 452.50 34-81 


Th Е 
er А ; 
atio between these variances is 


.81 
pt ے‎ 345, 
10.57 
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For 13 and 485 d.f., an F need only exceed about 2.17 to be sig- 
nificant at the т per cent level, hence in this case we may confi- 
dently reject the hypothesis of linear regression in the population 
sampled. However, if we wish to show, not that the hypothesis 
of linear regression is tenable or otherwise, but rather that curvi- 
linear regression is more probable than linear, we would be inter- 
ested in an F at a much lower level of significance. 

The test for linearity as applied to the row means would of 
course be exactly like that just described. 

It should be evident that with a small sample a greater absolute 
divergence from linearity is necessary for significance than with а 
large sample. That is, the "significance" of the departure from 
linearity is not a measure of the degree of curvilinearity. 


IO. TEST FOR SIGNIFICANCE OF A NON-LINEAR RELATIONSHIP 

We noted in Section 2 of this chapter that the test of significance 
there described assumed that the relationship, if any, is linear, 
and that we may have even a marked degree of non-linear relation- 
ship although the observed product-moment 7 proved not to be 
significant. In such instances, if there is any reason to suspect & 
curvilinear relationship, we might wish to apply a more inclusive 
test of independence. This test may be very simply described in 
terms of analysis of variance. Tt consists only of analyzing the và- 
riance of the total Y (or X) distribution into its between columns (ОТ 
rows) and within columns (or rows) components. If the variance 
for between columns (or rows) differs significantly (by the F-test) 
from the variance for within columns (or within rows), we have 
evidence that there is some relationship between X and Y, or we 
have disproved the hypothesis of independence. This test is 
superior to that of Section 2, in that it provides for the possibilities 
either of linear or non-linear regression. 

It should be noted that a significant departure from linearity 
does not necessarily mean that the relationship follows any simple 
curve. It may be that, even though this test indicated a marked 
departure from linearity, a straight line would still be the best to 
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use in predicting one variable from the other (Fisher gives an in- 
stance of this in § 44 of his Statistical Methods). The problem of 
fitting curved regression lines is beyond the scope of this book, but 
it may be worth observing that correlations computed from such 
regression lines may frequently prove worth while in educational 
research. The student interested in this problem should read F. C. 
Mills, Statistical Methods, pp. 432-441. 


II. THE CORRELATION RATIO 

If, in a study of the relationship between two variables, we have 
Shown that the relationship is curvilinear, we may wish to have 
Some index of the degree of relationship present. The measure 
used for this purpose is known as the correlation ratio, and is repre- 
Sented by the Greek letter 7 (eta). The nature of this ratio may 
Perhaps best be shown in terms of its relationship to the product 
moment у. 

The standard error of estimating one variable from known values 
uL linearly related variable by means of the linear regression 


equation is given by 
- ues 2 
Суга = Gy V X — foy 
f жы ЖЬ 
We solve for r in this formula, we have 


p 

fay = .Ír— a 

ie thus see that the value of 7, is dependent upon the ratio of 

cate which is the variance about the regression line within the 

" Tiva of the correlation diagram, and су, which is the variance 

Mis e total Y-distribution. This suggests a similar measure of re- 
‘onship when the regression is non-linear. We can define nys as 


2 б? — gi 
te = اہ‎ E | + > (40) 
" ту 05 
mn Which o2 


Means of a. now represents the variance within columns about the 
line 9f the columns, rather than about the straight regression 
` A similar ratio may be used to measure the relative concen- 
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tration of the measures in the rows about the row means. This is 
ci 05 — oi 
May = J к= eee о (41) 
z 
These ratios may be readily defined in terms of analysis of variance. 
The square of the ratio ys is simply the ratio (for large samples) 


of the “sum of squares” between columns (ss;) and the total sum of 
squares (ss;) for the Y-distribution. That is, 


55, 
1 = | 
55у 
and similarly, 
55, 
7. = — 
С $5, 


in which ss, represents the sum of squares between rows and ss, the 
total sum of squares for the X-distribution. These sums of squares 
may be computed by the method of Section 2 of Chapter V. 
The final step in the computation of a correlation ratio may be 
illustrated with the data in the example on page 237. In this ex- 
ample, ss, = 14235.22, and ss, = 19361.52, hence, 


= [14235.22 _ E 
ian N 19361.52 7352 = .86 


It should be clear from (40) that if there is no relationship, then 
05. Will equal су, and т will be zero. If the relationship is perfect, 
Фу, Will equal zero, and 7, will equal unity. Thus Nys (or Mey) may 
have any value from o to 1.00, but ca: 
as does 7. It should be observed that 
columns is somewhat dependent upo 
which is arbitrarily determined. 
that no column contains more tha: 
Squares between columns will be the 
and 7, will equal 1.00, regardless of 
best, therefore, to use a relatively 
correlation ratios, so that the colu 
relatively stable. 


nnot take negative values, 
the sum of squares between 
n the number of columns, 
Tf so small an interval is used 
n one measure, then the sum of 
same as the total sum of squares, 
the degree of relationship It is 
coarse grouping in computing 
mn (or row) means will become 


Since the variance of the measures in the columns about 2 
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Straight regression line can never be less than their variance about 
the observed means of the columns, it follows that 7,, (or 7,,) can 
never be smaller than r,. It is for this reason that we have said 
that r., underestimates the degree of relationship when the regres- 
sion is non-linear. . 


I2. BISERIAL CORRELATION 
There are frequent occasions, in educational research, to com- 
Pute a coefficient of correlation between a continuous variable and 
Опе which can be considered as dichotomous, that is, one which can 
be classed in only two categories. For example, in evaluating the 
individual items in an objective examination, one might wish to 
Compute the correlation between the scores on a criterion test and 
the responses (right or wrong) to a single test item. Again, one 
might Wish to measure the relationship between scores on an intel- 
1Бепсе test and the responses (yes or no) to a questionnaire item. 
a such instances, the dichotomous variable may sometimes be 
pe as essentially continuous, and as capable of being classed 
Der intervals if other methods of measurement were employed. 
fl. example, one might reason that in а given group of pupils 
eae many degrees of understanding of the same test item, and 
bass e tight-wrong method of scoring merely represents an arbi- 
Y Imposition of a dichotomy upon a continuous distribution of 
чаны of understanding. That is, the practice of classing the 
© ш аз either right or wrong may be considered as analogous 
nun 8 performances on a test as either “passing ” or "fail- 
© ms d of in terms of scores. If the dichotomous variable is 
бн Character one may, on tbe assumption that continuous 
compute of this variable would be zormally distributed, validly 

€ the correlation by the biserial method. 

he biserjal correlation coefficient (rx) may be defined in terms 


Of ei 
“ither of the following formulas: 
= € (42) 
т 8 
n ЗЕЕ. (43) 


с 5 
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In these formulae, M , represents the mean score (on the continuous 
variable) for the individuals in the first category (of the dichoto- 
mous variable), M, represents the mean score for the individuals 
in the second category, M; the mean score for the entire group, с 
the standard deviation of scores for the entire group, û the propor- 
tion of the entire group in the first category, g the proportion in 
the second category (р + q = 1), and z the height of the ordinate 
at the point on the normal curve (of unit area and unit standard 
deviation) which divides the area under the curve into the pro- 
portions p and g. Formula (43) is the more convenient to use 
when a number of biserial coefficients are to be computed against 
the same continuous variable, as when the correlation with а single 
criterion is to be computed for the responses to each item in а 
test, 

To illustrate the computational procedure, suppose that the 
mean and standard deviation of scores on a criterion test‘ are 56.0 
and 12.8 respectively for a given group of 200 pupils, and that 64 
of these pupils have responded correctly and 136 incorrectly to à 
certain test item. The problem is to compute ry, between the 
criterion scores and the responses to this item. To compute this 
correlation by means of (43), we must know what mean score on the 
criterion test was made by the 64 pupils that answered the item 
correctly. Suppose that this mean score is М » = 66.4. In this 
case p = .32, and we must therefore find the height (2) of the nor- 
mal curve at the point above (or below) which 32 per cent of the 
arca lies, after which we must find }/з. The values of /z for 
various values of p are given in Table r6. From this table we 


find that for р = .32, p/z = .8048. Hence for our example 
. _ 66.4 — 56.0 
тыз a X .8948 — .726. 


The biserial correlation coefficient has been quite widely used in 
educational research as an “index of validity? or “index of dis- 
crimination ” of individual test items. There are, however, a num- 
ber of other indices of discriminating power of a test item that may 
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TABLE 16 
VALUES OF p/z FOR VARIOUS VALUES OF f (NORMAL CURVE OF UNIT 
AREA AND Unit S.D.) 


I 2.9849 
1.6283 .82 3.1250 

4409 .23 -7575 .43 1.0048 .63 1.6686 .83 3.2700 
4 1.7107  .84 34524 

1.72549 85 3.6456 


1.8013 86 3.8638 

х À 1.8501 .87 4.1126 

08 „5382 .28  .8318 .48 1.2047 68 1.0015 .88 4.3991 
Р 1.9558 .80 47331 
2.0133 .90 5.12853 


2.0742 91 5.6038 
2.1389 92 бл 

3 +6145 Б .9112 2 1.3323 .73 2.2078 .93 .0264 
33 9 53 1.3323 c A e 

2.3601 .05 9.2111 


H6 — 6576 8 6 .06 11.1403 
36 .06б23 .56 1.419: -76 2.4447 9 

‘I7 6718 — .37 о 7 142 7] 25358 97 9 

38 — 6869 .38 .ооЗо .58 1.4838 .78 2.6343 .98 асас 

3 373 


a aues о 
ailable elsewhere, as in Workbook in Statistical Method, Jack W. Dunlap, p. 140.) 


f 5/2 for values of Р not given may be found by linear interpolation; more complete tables are 


be much more readily computed and that may prove quite ade- 
quate for most practical purposes. А 
The standard error of biserial r is given approximately by 


21 _ م‎ 
2 
nis VN 
Tie formula is limited in usefulness by the fact that the form of 
ET Sampling distribution is not known, and that it is decidedly 
Accurate for low values of f or 7. 


Wh 13. TETRACHORIC CORRELATION — 
Wh еп both of the variables to be correlated are dichotomous, as 
9n one wishes to compute the correlation between responses 
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(right or wrong) to two test items, the tetrachoric method may be 
employed. This method involves assumptions similar to that 
made in computing a biserial r. That is, it is assumed that both 
variables are essentially continuous, and that each would be 
normally distributed for the population involved if measured in 
sufficiently fine intervals. 

The correlation table for two dichotomous variables is a 2 X 2 
contingency table, as illustrated in the example of the following 
paragraphs. We shall let a represent the observed frequency in 
the upper left-hand cell, b that in the upper right-hand cell, б 
that in the lower left-hand and d that in the lower right-hand cell. 
(The table is usually so arranged that the right-hand column repre- 
sents the "superior" category of one variable and the upper row 
the “superior” category of the other.) The total number of cases 
is represented by N, and p and q represent the proportions of the 
entire group in the "superior" and "inferior" categories, respec- 
a+b cd The 


and q = 
height of the ordinate which divides the area under the normal 
curve (of unit area and S.D.) into the proportions p and д is then 
represented by z, while x represents the distance (in с units) of 
this ordinate from the mean (x is negative if p > .5 and positive 
if p < .5). For the second variable, p’, q', 2', and х' have similar 


Й b+d r 
meanings (> = E and g = 2: 3! A close approximation 


to the tetrachoric correlation coefücient may then be computed by 
solving for w in the equation * 


bc — ad net 
Weg = "а Es (=) т. (4) 
Upon substitution of the known values in this equation, the result 


1 The complete formula for computing т involves an infinite series in the right” 
hand side of the equation, but a fairly close approximation may be secured by drop- 
ping all but the first two terms of this series, as has been done in (44). See Pearson, 
Karl, "On the Correlation of Characters Not Quantitatively Measurable,” Philo- 
sophical Transactions, Royal Society of London, Series A, 1900, 195, pp. 1—47. 


A formula for the standard error of гы may be found in Kelley, T. L., Statistical 
Methods, p. 257. 


tively, of one variable, such that p= 
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may be simplified and expressed in the standard form of the 
quadratic equation. This standard form is aa + bz bc = o 
[the symbols in this standard form do not have the same meaning 
as those used in (44)] and the roots of this equation may be com- 
puted from 


—bzx – дас, 


2а 


To illustrate the computation of a tetrachoric-r by the direct 
method, let us suppose that for a group of 150 pupils, the con- 
tingency table representing the relationship between responses 
(right or wrong) on two test items is as follows: 


Item #1 


For this contingency table, a = 24, b = 56, € = 36 and d = 34. 

ence p = .5333, q = .4667, Р = .60, and @ = 4o. The values 
Of 2, 2, and x and a^ may be read from the table on pages 14 ff. in 
The Kelley Statistical Tables (Macmillan Co., 1938) for the given 
Values of p and p’. According to this table, 2 = .3976, х = 
70828, ;' = .3863, and x’ = — .2533- Hence the equation for 


Computing у, becomes 
(56)(36) — (24)(34) = үа 
.. 150.3976) (3863) 
Which, in the standard form, is 
.010487 fia + 7 — 3472 = € 


_ = Ge VG 17 aCor0487) (= 347?) 


(2) (.010487) 


Ме == 


Непсе 


Fret 
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One root of the quadratic equation is .35, the other has a negative 
value larger than — 1.00, and hence is impossible. The tetrachoric 
correlation coefficient between responses to these items is thus .35. 

This direct method of computing 7,,is so complex and time- 
consuming that, if no easier method of computation were available, 
7, WOuld perhaps be very rarely used. However, the computation 
of 7, has been made extremely easy by the computing diagrams * 
prepared by L. L. Thurstone and others. The way in which these 
diagrams may be used is fully explained in the preface of the vol- 
ume in which they are presented. In terms of the notation here 
used, one need only compute p and q', together with the relative 
frequency in one cell of the contingency table. The appropriate 
diagram is entered with these three values, and the value of fe is 
then read directly from the diagram. 

When Thurstone's computing diagrams are used, 7; is perhaps 
the easiest to compute of all correlation coefficients. For this 
reason, it has sometimes been used instead of the product-moment 
r when both variables are continuous, by arbitrarily imposing 
dichotomies on both distributions. F or example, one might char- 
acterize a test performance as "above" or “below” the median, or 
as “above” or “below” any selected point on the scale, instead of 
in terms of a score. This procedure is not recommended except 
when only approximate values of ғ аге desired (and when both dis- 
tributions are approximately normal in form and the relationship 
is linear). Obviously, ry may similarly be used in place of 75, as а 
convenient Way of computing indices of discrimination for indi- 
vidual test items. In general, rų is not a satisfactory measure of 
relationship for small samples, or for contingency tables in which 
any of the values 5, g, pb’, or q' are less than .05. 


‚14. THE RANK CORRELATION COEFFICIENT 
Sometimes it is possible to describe an individual's status in а 
group only in terms of rank position. 


* Chesire, L., Saffir, M., 
Correlation Coefficient, Uni 


For example, one might 


Thurstone, L. L., Computing Diagrams for the Tetrachoríé 
versity of Chicago Bookstore, 1933. s 
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subjectively arrange the pupils in a given group in order of their 
“honesty,” or in order of their “social adaptability,” or in order of 
their “leadership qualities," etc. If a quantitative measure of 
relationship is desired for qualities thus described, the rank cor- 
relation method may be employed. 
The rank correlation coefficient (p) is based on the difference 
(D) in the two ranks for each individual. The formula for p is 
| EM ee: 
N(N? — 1) (N = 1) N(N + І) 
It should be noted that in every set of three consecutive numbers 
one is always divisible by 3 and one by 2, so that the 6 in the 
numerator of the last term may always be cancelled in the compu- 
tation. To illustrate the computation of the rank correlation 
coefficient, suppose that the pupils in a group of 20 are ranked in 
two traits in the order indicated їп the following table. 


Ranks 
Pupil Trait I Trait 2 Difference (D) in Ranks 


сом An PWD н 


4 
4 
4 
7 
9 
5 
5 
7 
4 
6 
2 
3 
7 
I 
8 
2 
7 
9 
6 
4 
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From this table, 


_ 62 D ___ _&X 642 TUN 642 52 
oF (W-Ü)N(N4Y1 | 19 X 2e-X 23- 1330 
о 7 


The rank correlation coefficient should not be considered as equiva- 
lent to a product-moment r, although if quantitative measures of 
the traits ranked are normally distributed, p and 7 are very nearly 
equal in value for largesamples. In general, p should be considered 
as only an approximate measure of relationship which should be 
employed only when rank-position is the only available means of 
describing individual status. A rough test of the significance of 


the observed p may be secured by comparing it with its standard 
error as given by the formula 


m .05(1 — р"). 
VN 

А good deal of work on the sampling distribution of p has been 
done recently, and the student interested in more exact tests of 
significance should see the article by Hotelling and Pabst in the 
January, 1936, number of Annals of Mathematical Statistics and 
that by M. G. Kendall et al. in Biometrika, 1939, pp. 251 ff. 

It is well worth noting here that data scored in ranks can be 
satisfactorily treated for many purposes by transforming them 
into equivalent normal deviates. Fisher has provided a table for 
facilitating these transformations (Tables XX and XXI in Fisher 
and Yates, Statistical Tables) and notes that analysis of variance, cor- 
relations, etc., can be carried out with these transformed measures. 


(46) 


15. PARTIAL CORRELATION 

The correlation between two variables for а given sample is 
very frequently influenced by the presence of a third variable, or of 
a number of other variables. For example, if we were to take 4 
large sample of public school children of all ages from 6 to 16, We 
would find a significant positive correlation between measures of 
height and spelling ability for this sample. This of course would 
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not mean that the taller pupils are better spellers because they are 
taller, nor that they are taller because they are better spellers. 
Rather, the observed correlation may be readily explained by the 
fact that both height and spelling ability increase with age, and 
hence the older pupils are both taller than the younger pupils 
and better spellers than the younger pupils. The fact that there 
is little or no intrinsic relationship between height and spelling 
ability may be readily demonstrated by computing the correlation 
between measures of these traits for a group of pupils all of whom 
are of the same age, in which case no significant correlation would 
be found (unless due to still other unspecified extraneous variables). 

When the relationship between two variables for a given group 
is influenced by the presence of a third variable, a measure of the 
relationship with this influence removed may ( if the relationships 
are linear) be computed by the technique of partial correlation. 
The partial correlation between variables т and 2, with variable 3 
"taken out," is given by the formula 


Tua = (47) 
(x d ri) ad r2) 
This coefficient 7,44 (read “r sub one two point three") is a partial 
Correlation coefficient of the first order. ‘The coefficients 7з», rz; 
and r, are zero order coefficients. Similar formulas may be 
Written for 7132 апа r4. For example, suppose that for a given 
Sample of 85 school pupils, variable 1 is a measure of chronological 
age, variable 2 a measure of height, and 3 a measure of spelling 
ability, Suppose also that 7, = 85, ^з = бо, and fa; = 45 
(these are fictitious data). For this group 
23 — Tra з 
(т —л.)(т - Tis) 
PTEICOICOMENC д, 


= G- SAG- 60) 4214 


The significance of this partial correlation may be tested in the 
Same manner as a zero order coeficient, except that the effective 


Tazı = 
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size of the sample is considered as one less than the actual size. 
Hence we would test this partial r for a sample of 85 as if it were a 
zero order 7 for a sample of 84. From Table 1 3, page 212, we see 
that for a sample of 84 an r of 142 is not significant even at the 
5 per cent level, whereas for а sample of 85 an 7 of .45 is highly 
significant. These data, then, are quite consistent with the hy- 
pothesis that there is no intrinsic relationship between height 
and spelling ability for the population involved. 

We have already observed that the partial correlation between 
height and spelling ability, with the effect of age eliminated, may be 
experimentally determined by selecting a number of Pupils of the 
same age from the population and computing the correlation for 
these pupils alone. The partial correlation computed from the 
original age-variable sample by (47) may be considered as an 
average of such zero order s. Suppose we had a very large sample 
of pupils varying in ages 6 to 16, and that we split this sample 
that all pupils in each group were 
within one-half year of the Same age. That is, we would have 
one group of six-year-olds, another of seven-year-olds, etc., up to 
a group of sixteen-year-olds. We could then compute the correla- 
tion between height and spelling ability for each of these age 
groups. The average of these correlation coefficients would then 
be essentially the same as the partial correlation between height 
and spelling ability with age held constant, computed by (47) 


from the entire sample (ages having been expressed in years at 
the nearest birthday). 


In a manner similar to that alread 
compute partial correlations wi 
constant. The general formula 


y explained, it is possible to 
th more than one variable held 
for a partial correlation coefficient 


is 
Teo. = 
и жа [tz] [КЕ 1 (48) 
I = "паза ..(n—t) | [x а за Жет 
In the case where two variables are held constant, this formula 
becomes 


fes ао. 
ъд 123 14:3 T2403 


GR ж). 
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When three variables are held constant it becomes 
fe Tirzas — T1534 72534 
V (1 = ris) — 050) 
and so forth. (See page 257 for references on computational 
procedures.) 

In general, the order of a partial correlation coefficient is the 
number of variables held constant. The significance of any 
partial r from a random sample may be tested by means of Table 
17 by considering the effective size of the sample as reduced by 
one for each variable held constant. 

The interpretation of partial correlation coefficients has caused 
considerable difficulty in educational research. It should be 
recalled particularly that a correlation coefficient tells nothing 
Whatever about the nature of the cause-and-effect connections 
between variables. The fact that ғаз = o may mean that A is in 
Part caused by B, or that B is in part caused by A, or that each is 
In part caused by the other, or that both are in part caused by 
Other and unspecified variables. The correlation coefficient in itself 
Provides no clue whatever as to which of these is the correct ex- 
Planation. The same is true of partial correlation coefficients. 
The Partial correlation rysc iS essentially a zero order 7,5 for a 
group in which those factors which would otherwise give rise to 
Variations in C are inoperative, but even if it is possible to identify 
these factors, the problem of interpreting the remaining correla- 
Hon is of the same nature as before. , 

The interpretation of partial 7’s in educational research is 
Complicated further by the prevalence of the so-called “jingle 
allacy,” to which reference has been made earlier (page 234). 

ага] 7s computed for the scores on educational or psychological 
tests аге always highly ambiguous, and in most instances it is 
Practically impossible to arrive at a clear-cut interpretation of 
Them. Suppose, for e: smple, that we compute 74sc for a group 
In which A represents the score on a reading test, B that on an 
arithmetic test, and C that on a general intelligence test. In the 

St place, we must recognize that the factor which is held constant 
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is only the score on the intelligence test. Because of the unreli- 
ability of that test, there is still considerable uncontrolled vari- 
ability in whatever the test measures. In the second place, none 
of the scores is a pure or perfectly valid measure, even aside from 
chance errors in measurement, of the ability implied in the title of 
the test, but is partly a measure of many irrelevant factors which 
may or may not influence the scores on the other tests as well. In 
the third place, a good measure of general intelligence is in part a 
measure of arithmetic ability and of some of the factors involved 
in reading. In “holding constant” intelligence we may therefore 
be holding constant more than we should, In the fourth place, we 
are rarely able to provide a very meaningful and exact description 
of what each test is intended to measure, to say nothing of what it 
actually does measure. In the fifth place, the observed z,5. is 
subject to sampling errors, and may be very markedly influenced 
by chance fluctuations in the zero order 75 (see page 211). In fact, 
partial 7’s cannot possibly be very meaningful unless computed 
from much larger samples than have usually been employed for 
this purpose in educational research. Finally, there are many 
other factors, in addition to those measured by a general intelli- 
gence test, which might in part account for the correlation between 
the scores on the arithmetic and reading tests, and which have not 
been held constant in this analysis. Under all of these conditions, 
it would be foolhardy indeed to attempt to say much, on the basis 
of partial correlations, about the cause-and-effect relationships 
between the traits implied in the test titles, Unfortunately, how- 
ever, many such attempts have been made in educational research, 


with results often obviously inconsistent with common 


-sense con- 
siderations. 


The foregoing is by no means intended as an adequate discussion 
of the use and interpretation of partial correlation techniques in 
educational research. Before attempting to make any use of these 
techniques, the student should familiarize himself with the con- 
tents of more exhaustive discussions of these techniques which are 
readily available in the literature of educational research. Among 
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the best of these discussions are those in Chapter XIV of Statistics 
in Psychology and Education, by Henry E. Garrett, in Chapter XI 
of The Scientific Study of Educational Problems, by W. S. Monroe 
and M. D. Engelhart, and in an article “On the Analysis of Causa- 
tion," by Jack W. Dunlap and E. E. Cureton, Journal of Educa- 
tional Psychology, December, 1930 (Volume XXI). 


16. MULTIPLE CORRELATION AND REGRESSION 

The student will recall, from his study of simple correlation the- 
ory, that whenever a linear relationship exists between two vari- 
ables, the regression equations for these variables may be used to 
estimate or predict values of one variable from known values of 
the other variable. The precision of this prediction depends upon 
the correlation between the variables, and is measured by the 
standard error of estimate. 

For example, suppose that for a sample of college freshmen, X, 
Tepresents the grade-point average of a student, X, his score on an 
entrance examination, M, the mean of the X,’s, M, the mean of 
the X;'s, c, and с, the corresponding standard deviations, and 
that а, = y, — M, and х, = X, — М, Er, is known, we can 
then predict the grade-point average of any student from his 
entrance examination score by means of the following regression 
€quation. In deviation form, this equation is 

& = Tra ay, (49) 
о, 


and in raw score form 
тү 
Xi-rQ X (r. Tu m) (50) 
с; о, 


a these equations x? and X1 are the estimated values of x, and Ху, 

respectively, If such estimates were made for all students, then 

б, 95 ?x!x, would be the same as fra. The standard error * of these 

m oe this standard error is valid only for estimates ben a Le E 2 
mean, because of sampling errors in the regression co 


Sion of this sai Thi of Error to the Interpreta- 
i problem, see “The Application of the Theory 2 2: 
On of Trends," = Working and H. Hotelling, Journal of American Statistical Asso- 
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estimates would be given by 
013 = OV ү — 72. (51) 
If several entrance examinations had been given these students, 
one could of course predict the first year grades in this fashion from 
the scores on any one of the examinations, although the examina- 
tion to use would obviously be that for which the scores correlate 
most highly with grades. However, by computing a composite of 
the scores on all examinations for each student, and finding the 
regression equation for estimating grades from these composites, 
one might secure better estimates of grades than could be secured 
from the scores on any one examination alone. Obviously, there is 
an unlimited number of ways in which the scores on such examina- 
tions may be combined. The simplest composite would be the 
sum of the scores on the various examinations. In most instances, 
however, a better composite could be secured by giving certain 
Scores more weight than others. The problem then is to find what 
system of weights will Produce a composite which will correlate 
more highly with grades than will any other composite. An ap- 
proximate solution to this problem could be found empirically by 
trying out each of a large number of arbitrary systems of weights, 
actually computing a composite for each student according to each 
system for a large number of students, and finding which composite 
correlated most highly with grades. Fortunately, this problem 
may be more conveniently and satisfactorily solved by the methods 
of multiple correlation. By these methods, it is possible to compute 
directly the equation which gives the best possible linear combina- 
tion of a number of (independent) variables for the purpose of 
predicting another (dependent) variable. This equation is known 
as the multiple regression equation. The first order multiple re- 
gression equation (for two independent variables) is expressed in 
deviation form as 


, Cr. Cr. 
ж = tg 5, F 3. qd. (52) 
055 93.2 
ciation (1929), Volume XXIV, pp. 73-85. 
tions for the use of the formula for the si 
to scores on educational tests. 


This article also has important implica- 
tandard error of measurement as applied 
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in which 7,,,; and 7,,; are the partial correlation coefficients com- 
puted as in (47), in which 9,4, Caz, and C}, are the partial os 
computed as in (51), х is the estimated x,, and x, and x, are the 
known deviations of the values of the dependent variables from 
their respective means. 

The correlation (R,..;) between these estimates (xi) and the 
actual values of x, is known as the multiple correlation coefficient, 
that is, Кы, = 7x!z,, This multiple correlation coefficient may be 
computed * from the zero order r’s or first order partial 7’s by 


2 2 
Es _ fat Ti — 27а (53) 
۷ 1-7, 


Rios кау ч = (т E E LI # ы) (54) Р 


The standard error ? of estimates based оп a first order multiple 
Tegression equation is given by 


быз = 0; V I — fis V I — figa (55) 


The raw score form of the first order multiple regression equation 


or 


1 O13 91.3 
B X! = ras Xa rua 2 Xs 
з с. 
C23 32 


Ie; Cia 
+ (M: -= тагм, = 7152 с, М;) (56) 
23 "2 


To illustrate the use of these first order equations in the predic- 
tion-of-grades situation previously referred to, suppose that vari- 
able т represents grade-point averages, that variables 2 and 3 are 
Scores on two entrance examinations (which we will call Test 2 and 
Test 3), and that for a given sample of freshmen 


М, = 2.10 с; = 1.16 7; = .650 
M, = 48.70 с, = 11.15 Ti = .524 
М, = 62.05 т; = 15.25 fa = -415 


* When only two dependent variables are involved, and only Rss and the multiple 
Tegn ession equation are needed (that is, when the partial r’s and o’s are not needed), 
& simpler method of computation is that suggested on pages 206 to 207. 


* See footnote on page 253- 
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Then Убай 
к o co „„ 
(x — .524°)(т — .475°) 


—524— (6s5o)(475) 
(т — 650) (1 — .475°) 


Tiz = сутп, = 0.988 
O23 = А = 9.812 
ы = смт т, = 0.882 
C3. = сут ri, = 13.420 
from which the multiple regression equation in deviation form is 
0.882 
13.420 


= 0.05387 х, + 0.02116 x, 
or, in raw score form, is 


Yi 


and 


n 0.988 . 
r 535 2ر‎ Т 822 Y. 


X1 = 0.05387 X, + 0.02116 X, — 1.8365 
The multiple correlation Кү in this case is 


Ris VI – (= -650°)(1 — .322°) = .6945 
and the corresponding standard error of estimate is 


015, = I.16 V 1 — 6507 мї = +322? = 0.8345 


If, then, a certain student made a score of 50 оп Test 2 and of 7o on 
Test 3, our estimate of his grade-point average (“Test 1”) would 
be 2.77, with a standard error of 0.8 35. We might then be reason- 
ably sure that his actual grade-point average will lie within two 
standard errors of the estimate, or between 1.10 and 4.44. 


The general forms of the preceding equations, for any number 
(n) of variables, are 


Or 9, 
4 T: I34...5 1:24... 0 
Жт.з...п = Tinsq...n 5. fuas nq 
234...n 3:24... 
3 O123....(n— 1) 
кейе Eg (gray hee, (57) 


On23....(n—1) 
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0:55...5 = 

E.g e ص‎ ЧЫ) 
OY V І — fra I — Fiza IT 4.23 °°° I — Tisaos..(n—1) 
15234..в T 


Мт T [G с rij = rig) 2 fis) di (т ^ Ж жаз EEN | 6% 


The computational work involved in a multiple-correlation 
problem for more than three or four variables is obviously very 
laborious and time-consuming. Excellent outlines of preferred 
computational procedures are given on pages 120-124 of W orkbook 
in Statistical Method, by Jack W. Dunlap (Prentice-Hall, Inc., 
1939). The same reference also presents (page 127) an excellent 
bibliography on the methodology of multiple correlation. Other 
excellent discussions of the computational procedures in multiple 
correlation analysis are given in Statistics in Psychology and Educa- 
tion, Henry E. Garrett, pp. 409-460, and in Statistical Methods for 
Students of Education, Karl J. Holzinger, pp. 283-315. 

No attempt is made here to present any adequate discussion of 
the use and interpretation of partial and multiple correlation tech- 
niques in educational research. The primary purpose of this book 
has been to make more readily available to students of education 
only those relatively recent developments in statistical theory 
which have not yet been adequately treated in educational texts in 
Statistics or with specific reference to educational applications. 
The uses and interpretation of partial and multiple correlation 
techniques have been very adequately treated in the texts by 
Garrett and Holzinger, and in The Scientific Study of Educational 
Р roblems, by Monroe and Engelhart, pages 323-389. The student 
who is interested in these techniques or hopes to make any applica- 
tion of them is strongly urged to become familiar with these 
references, 
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TABLE 18 
TABLE ОЕ RANDOM NUMBERS ! 


т Ж 4 8B 6 v B 9 #6 Hr I2 15 44 ig 16 17 18 19 20 


I 03 47 43 73 86 36 96 47 36 бі 46 98 63 71 62 33 26 16 80 45 
2 97 74 24 67 62 42 Зі 14 57 20 42 53 32 37 32 27 07 36 07 SI 
з 16 76 62 27 66 56 so 26 71 оу 32 90 79 78 5з 13 ss 38 58 59 
4 12 56 85 99 26 96 96 68 27 SI 05 03 52: 03 T8 57 KE 10 х 4I 
5 55 59 56 35 64 38 54 82 46 21 зі 62 43 до до об 18 44 32 53 
6 16 22 77 94 39 49 54 43 54 82 17 37 оз 23 78 87 35 20 96 43 
7 84 42 17 53 31 57 24 ss об 88 77 O4 74 47 б) 2r 76 33 so 25 
8 бз от 63 78 so 16 9s ss 57 19 98 то зо 71 75 12 86 73 58 07 
9 33 2I 12 34 29 78 64 56 o; 82 52 42 07 44 38 15 sr oo i13 42 
10 57 60 86 32 44 009 47 27 96 s4 ай ту 46 o9 62 оо s2 84 77 27 
II 18 18 оў 92 46 44 17 16 58 o9 79 83 86 19 62 об 76 so o3 ТО 
12 26 62 38 97 75 84 16 оў 44 99 83 Іг 46 32 24 20 14 85 88 45 
Ss 39 $290 "4 " В or 77) ор Br ор ae xe dà o8 32 98 94 07 72 
14 52 36 28 19 95 бо 92 26 тї 97 Оо 56 76 зт 38 80 22 o2 53 53 
IS 37 85 94 35 12 83 39 so o8 зо да 34 о] 96 88 54 42 об 87 98 
16 70 29 17 12 I3 40 33 20 38 26 13 89 SI оз 74 17 76 37 13 94 
I7 56 ба 18 37 35 96 83 so 87 75 оу 35 ag 93 47 7о 33 24 O3 54 
18 99 49 57 22 77 88 42 95 45 72 16 64 36 16 oo ол 43 18 66 79 
I9 10 or 1504 74. 33 27 14 GE od ш gg 34 68 49 12 72 оў 34 45 
20 31 16 93 32 43 so 27 89 87 19 ж IS 37 00 49 52 85 66 60 44 
21 68 34 30 13 јо 55 74 30 7; 40 44 22 78 84 26 o4 33 36 09 52 
22 74 57 25 65 76 So 29 97 68 бо т or 38 67 54 13 58 18 25 27 
aS 47042 3? 00 АЗ 48 55 0o 05 £s 96 45 60 36 10 96 46 92 42 45 
24 Оо 39 68 29 бб 66 37 32 20 30 77 84 57 o3 29 1o 45 65 04 26 
25 29 94 98 94 24 68 49 69 то 82 S3 75 От 93 30 34 25 20 57 ?7 
26 16 до 82 66 so 83 62 64 rir 12 67 19 oo 71 74 бо 47 21 29 68 
27 IF 27 94 75 06 об оо 19 74 66 o2 ол 37 34 o2 76 70 оо зо 86 
28 35 24 Іо 16 ао 33 32 5I 26 38 79 78 45 O4 І 16 92 53 56 16 
79. 35 93 16 86 39 44 38 or or so 87 95 56 Er 4I . до ог 74 от 62 
39 EE 99'8€ OF 47 06 a4 Ws 45 Ex Cai gh Xa cs or од 52 43 48 85 
зт 66 67 до 67 14 64 os 71 95 86 II os 65 og 68 76 83 20 37 90 
34 ЧА 990 94 4S 1b 05 53 88 оу до $a dy 41 14 86 22 98 12 22 08 
33 68 05 бі 18 oo 33 96 оз 74 19 oy бо 62 93 SS 59 33 82 43 90 
34 20 46 78 73 до оў sr до 14 оз 94 02 33 31 o8 39 54 16 49 36 | 
35 64 19 58 97 79 15 o6 15 93 20 от 90 то 75 06 до 78 78 89 62 | 
т SOM DOM Фу айк £8 T» Wa ke dx xp sy o rye тас 3s 
37 07 97 10 88 23 оо 98 42 до 64 бі 71 62 99 об бі 29 16 93 15 
04.2 1 8117: 2. TTTM "Qc | 
39 14 65 52 68 74 87 37 48 22 аг 26 78 63 об 55 13 o8 a7 or so | 
40 17 S3 77 58 JI 71 S9 36 50 72 I2 4I 94 96 26 44 95 27 36 99 | 
4! 90 26 59 21 I9 23 д бі 33 ra 96 93 o2 18 39 оў o2 18 36 07 
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: Taken by consent from Table 33, Statisiical Tabl Bilbasî ә d сга, 
R. A. Fisher and F. Yates, by permission of Oliver an nord Кайыш ЧОН 
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TABLE 18 (continued) 


> X ud Ж O0 7 € о Io тї i3 I3 чї 15 16 17 18 r9 20 
74 23 99 67 бі 32 28 69 84 o4 62 67 86 24 98 33 a 19 б 
38 o6 86 54 99 oo 65 26 94 o2 82 go 23 07 79 62 67 80 бо 
3o 68 21 46 об 72 17 IO 94 25 3 3i 74 90 40 28 34 бо 49 
43 36 92 бо 65 sr 18 37 88 бі 38 44 12 45 32 92 84 88 65 
25 37 55 26 от gr 51 бб 40 74 7:1 12 94 OF 24 O2 jr 37 OF 
63 зі 17 бо 71 so 8o 39 56 38 15 40 тїт 48 43 40 45 86 98 
55 22 21 82 48 22 28 o6 oo бї 64 їз 54 ot 82 78 r2 23 20 
07 26 13 80 ог о 07 82 од 59 63 69 36 o3 бо тт 15 83 8o 
54 16 24 15 sr S4 44 82 oo 62 бї 65 o4 бо 38 18 65 18 97 
85 27 84 87 бі 48 ба 56 26 до 18 48 13 26 37 70 15 42 57 
92 18 27 46 57 99 16 96 56 30 33 72 85 22 84 64 38 56 98 
30 27 50 37 62 75 41 66 48 86 97 80 бі 45 23 53 ол or 63 
45 93 IS 22 бо 21 75 46 or 98 77 27 85 42 28 88 GI o8 84 
O8 ss 18 40 45 44 74 I3 90 24 94 96 бт o2 57 ss 66 83 15 
85 89 95 66 sr о 19 34 88 15 84 97 19 75 12 76 39 43 78 
84 71 14 35 то тї 58 49 26 бо Ir 17 17 76 86 зї 57 20 18 
78 28 16 84 13 52 53 94 S3 75 45 бо зо 96 73 89 65 70 31 
17 75 Gs s» 28 до 19 72 12 25 12 74 75 67 бо до бо I r9 
76 28 i2 54 22 or тї 94 25 71 96 16 16 88 68 64 36 74 45 
31 67 72 30 24 o2 94 o8 63 38 32 36 66 ог бо 36 38 25 39 
44 66 44 21 66 об 58 o4 62 68 15 54 35 o2 42 36 48 96 32 
66 22 15 86 26 63 74 41 99 58 42 36 62 24 58 37 52 18 sr 
24 до 14 SI 23 22 30 88 s; 95 67 47 29 83 94 бо до об оў 
73 ох бї 19 бо 20 72 93 48 98 57 оў 23 69 65 95 39 69 $$ 
бо 73 99 84 43 80 94 36 34 56 бо 47 оў 4% оо 22 9I 07 12 
37 90 61 56 70 то 23 98 os 85 II 34 76 бо 76 48 45 34 60 
67 то o8 23 08 оз 35 08 86 99 29 76 29 81 33 34 9I 58 93 
28 so o; 48 89 ба 58 89 75 83 85 62 27 89 зо 14 78 56 27 
15 83 87 бо 79 24 3r 66 56 21 48 24 об 93 от 98 94 05 49 
19 68 97 65 оз 73 52 16 56 оо ss 55 до 27 33 42 29 38 87 
81 29 13 зо 35 or 20 71 34 62 33 74 82 14 53 73 10 00 9% 
86 32 68 оз 33 08 74 66 99 40 14 71 94 58 35 94 19 38 81 
От 70 29 їз 80 оз 54 07 27 96 94 78 32 66 so 95 52 74 33 
71 67 gs i3 20 o2 77 95 94 64 85 o4 05 72 OI 32 90 76 14 
66 тз 83 27 оз 79 64 64 72 28 54 96 53 84 48 14 52 98 94 
96 o8 4s 64 13 os oo лт 84 93 оў S4 72 50 21 45 57 09 77 
83 43 48 36 оз 88 33 69 96 72 36 o4 19 76 47 45 15 18 60 
бо уг 62 46 до 80 8r зо 37 34 39 23 04 38 25 15 35 7I 39 
17 so 88 ух 44 от 14 88 47 89 23 30 63 15 56 34 20 47 89 
69 10 бї 78 71 32 76 95 62 87 oo 22 s8 40 92 54. or 7$ 25 
93 36 47 83 56 20 14 82 ІІ 74 21 97 99 65 96 42 68 63 96 
30 92 29 оз об 28 81: 39 38 62 25 06 84 63 бі 29 o8 93 67 
29 so ro 34 зі s; 75 95 8o 51 97 оз 74 77 76 15 58 40 44 
3I 38 86 24 37 79 8I 53 74 73 24 16 10 33 52 83 оо 94 7 
OI 23 87 88 58 o2 39 37 67 42 10 14 20 92 16 55 23 42 45 
33 95 22 oo 18 74 92 oo 18 38 79 58 69 32 8: 76 80 ^6 02 
84 бо 79 80 24 36 so 87 38 82 o7 53 80 35 96 35 25 79 : 
49 62 98 82 54 оу 20 45 95 15 74 80 o8 32 I6 46 70 us o 
31 89 оз 43 38 36 92 68 72 32 14 82 99 79 8o 60 47 Z 97 
59 73 os so o8 22 23 71 77 91 OF 93 20 49 82 96 59 20 94 
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INDEX 


Analysis of covariance, see Covariance 

Analysis of variance, see Variance 

Attenuation, correlation coefficients cor- 
rected for, 233-235 


Biserial correlation, 241-243 


X^, 30 ff.; sampling distribution of, 34-37; 
table of, 36; tests of goodness of fit, 
37-41; tests of independence, 41-43; 
tests of homogeneity, 43-46; combin- 
Ing probabilities, 46-47 
Ombining correlations 
Samples, 218-219 

Combining probabilities, 46-47 

Confidence, levels of, 13-15 
orrelation, measures of, 208 ff.; correla- 
tion ratio, 239-241; biserial correlation, 
241-243; tetrachoric correlation, 243- 
246; rank correlation, 246-248 
orrelation ratio, 239-241 
штыр, 180 fi.; nature of, 180-183; 

nalysis of in simple methods experi- 
ments, 191-106; computational proce- 
ae In simple methods experiments, 
95-196) analysis of in duplicated ex- 
periments; 196-203; computational 

B cedure in duplicated experiments, 

97-203; applications of, 204 fi. 


from several 


Degrees of freedom, 33-34 
ioe experimental, 80-84; factorial, 
ISProportionate class numbers, 152 


Estimati 
true 
Experi 


On, of true variance, 48-50; of 
E mean fiom stratified sample, 161 
Derimenta] designs, 80-84 


E vas А 
Ё, variance ratio), 60 ff.; table for, 62-65 
du al designs, 163 ff.; analysis of 
Plicated factorial designs, 173 ff. 
H " 
Hor geneity of correlations, 219 ff. 
am geneity of variance, assumption of, 


In i 
rjgctions, 110-113, 117-118, 150, 166, 


Limitations of large sample theory, 18-21 


Linearity of regression, test of, 235-238 
Multiple correlation, 253-257 
Null hypothesis, 15-17 


Parameter, 8 

Partial correlation, 248-253 

Population, definition of, 2; finite, 2-3; 
infinite, 2-3; real, 3; hypothetical, 3 

Precision, 10, 76-78 


Random numbers, 25; table of, 262- 
264 

Random selection, 24-29 

Randomized groups, 151 

Rank correlation, 246-248 


Sample, definition of, 2; random, 3-4; 
biased, 4-5; stratified, 5; controlled, 
5-7; representative, 6; matched, 6-7 

Sampling distribution, 8; of /, 51-54; of 
X^ 34-37 А 

Sampling іп educational research, 21-24 

Significance, tests of, 15-17; levels of, 
16-17; of the mean of a small sample, 
54-55; of a difference in the means of 
independent small samples, 56-58; of 
а difference in the means of related 
measures, 58-59; of a difference in 
variability for small samples, 60-66; 
of the mean of a sample consisting of 
randomly selected, intact groups, 66- 
72; of a difference in means for samples 
each of which consists of relatively 
homogeneous subgroups, 72-75; of à 
product-moment correlation coefficient, 
210; of a difference between 7’s from 
independent random samples, 214 ff.; 
of a difference between r's from related 
samples, 217-218; of a non-linear rela- 
tionship, 238-239; heterogeneous vari- 
ance, effect upon F-tests, 139-144 

Standard error, 8-9; of mean of a large 
random sample, 12; of mean of a con- 
trolled sample, 158 ff.; of mean of a 
stratified sample, 161 ff.; of biserial 7, 
243 . 

Statistic, 8 


1, sampling distribution of, 51-54; table 
of, 53 
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Tables, of ж, 36; of 1, 53; for F, 62-65; 
significant values of r for various sized 
samples, 212; values of z for various 


values of r, 215; values of £ for values 


of p, 243; normal probability, 261; 
random numbers, 262-264 

Testing hypotheses, 10-15 

Tests, of goodness of fit, 37-41; of inde- 
pendence, 41-43; of homogeneity, 43- 
46; of homogeneity of variance, 99; of 
significance, see Significance 

Tetrachoric correlation, 243-246 


Validity of a test л times as long as a given 
test, 228-233 

Variance, analysis of, 84, 87 ff.; analysis 
Into two components (analysis of results 
of simple methods experiments), 93-99; 


INDEX 


computational procedure for analysis 
into two components, 100-104; homo- 
geneity of variance, 99, 132 ff.; analysis 
into three components (analysis of 
pooled results of duplicated ехрегі- 
ments), 104-114; interactions, inter- 
pretation of, 110-113, 117-118, 150, 
166, 173; analysis into four components 
(analysis of pooled results of duplicated 
experiments), 114-119; computational 
procedure for analysis into four com- 
ponents, 119-127; applications of, 127 
fi., 145 ff.; homogeneity of variance, 
assumption of, 132 ff.; randomized 
groups, 151; disproportionate class 
numbers, 152; factorial designs, 163 fi. 


2(function of r), 212; values of z for vari 


ous values of r, 215 
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